Assignment: Integrative Genomics Practical Course: UVIC MS Omics Data Analysis-Interactomics

The aim of this practical is to study the impact of age in adrenocortical carcinoma (ACC).For that, an analysis needs to be performed integrating at least 2 different blocks of omics data following these steps:

  1. Use the miniACC data in MultiAssayExperiment Bioconductor’s package
  2. Use three omics data blocks of the following: Transcriptomics Two of the following: miRNA, protein, mutation or copy number (CN)
  3. Study the distribution of variable age (variable years_to_birth) on the common samples.
  4. Identify the aged and young patients. You can set a threshold or use the tails of the age variable distribution, for instance. Each group must have at least 5 samples. Explain this selection and make sure that selected patients have the same tumoral samples in the omics data types selected.
  5. Study the blocks of omics data you have chosen (perform distribution plots, check categorical copy numbers or logR, etc.) and transform them and/or filter them before performing the multiomics analysis. Explain all steps you perform.
  6. Apply MFA or mixOmics DIABLO to see differences between aged and young patients. Keep in mind that you will require several steps for preparing the data by matching cases in the databases, selecting necessary variables, transposing, etc. All these steps need to be clearly described.
  7. Choose a relevant graphic or two to include in the report (do not include all of them!).
  8. Drive some conclusions

SUMMARY OF RESULTS AND CONCLUSIONS

A MultiAssayExperiment object miniACC was evaluated to study the impact of age in adrenocortical carcinoma (ACC). Ten patients (5 old and 5 young) were selected from the right and left tails of a graphically-depicted normal distribution of the variable age (years_to_birth) on common samples of three individual (Ranged)SummarizedExperiments depicting mRNA-seq, miRNA-seq, and gene-based GISTIC CNV recurrent lesions. These three data blocks were then filtered, TPM-normalized, log tansformation, scaled, and individually analyzed and later correlated before being ultimately evaluated via Multi-Factor Analysis and other means.

Regarding mRNA-seq Summarized Experiment:

The ward.D2 hierarchal clustering appeared to reflect the segregation of 5 old and 5 young patients.There appeared to be a normal distribution of log2-ratios of TPM-normalized mRNA-seq count values. PCA analysis revealed no apparent segregation of by age status, and a total of 25.07%+ 19.65%=44.72% variance was accounted for by the first 2 principal components PC1 and PC2 and corresponding eigenvector values.A heatmap visually revealed that old patients were relatively underexpressing more mRNA genes.Based on pooled results from limma/voom/edgeR, and DESeq2 modeling, the genes that were deemed to be differentially expressed with respect to old/young age status included AKT1S1,ASNS,NRAS,MAPK9,ITGA2,ADAR,EGFR, FASN,SERPINE1,TSC2,YBX1,SHC1,TGM2,RAD50,PIK3R1, XBP1, SYK, and CDKN2A. From DESeq2 model alone, there were 3 statistically differentially overexpressed (ITGA2, TGM2, ASNS) and 2 statistically differentially underexpressed genes (CDKN2A, NRAS) identified.Based on BioMart-derived description and GO-based Gene Ontology analysis, some of these statistically differentially overexpressed protein-coding genes were associated with phago-and endo-cytosis and asparagine-glutamine metabolic processes.Chromosomes 1,5, and 7 had the most differentially expressed genes. Evidently, TGM2 is overexpressed in old patients and underexpressed in young patients CDKN2A is overexpressed in young patients. CDKN2A is abberantly downregulated in the Old Patient A5LC

Regarding miRNA-seq Summarized Experiment data block:

The ward.D2 hierarchal clustering did not appear to reflect the segregation of 5 old and 5 young patients.PCA analysis revealed that with the exception of young patients A5J9 and A5JI and old patient A5LC, differences between the young and old patient samples (in dim 1 and dim2)were observed, along with a significant 28.27+18.86%=47.13% total variance being captured by the first 2 dimensions, respectively.Based on pooled modeling results from limma/voom/edgeR, and DESeq2 modeling, the miRNA genes that were deemed to be differentially expressed with respect to old/young age status included hsa-mir-153-2, hsa-mir-153-1, hsa-mir-541, hsa-mir-412, hsa-mir-3200, hsa-mir-675, hsa-mir-1248, hsa-mir-9-2, hsa-mir-9-1, hsa-mir-1229, hsa-mir-511-1,hsa-mir-507,hsa-mir-107,hsa-mir-148b, hsa-mir-542, hsa-mir-98, hsa-mir-887, and hsa-mir-9-3. Specifically, based on visual heatmap, miRNA genes hsa-mir-511-1 was overexpressed in old patient A5LL, A5L5 and underexpressed in young patients A5LE and A5J9 and A5KV. On the other hand, miRNA gene hsa-mir-675 was underexpressed in young patients A5J9, A5JI, A5K0, A5JE, A5KV and overexpressed in A5LL, A5JF, and slightly in A5LC, A5L5. Based on NCBI and BioMart-derived data, hsa-mir-1229 and hsa-mir-675 are located on chromosomes 5q35.3 and 11. Gene hsa-mir-511-1 is situated on chromosome 10 at 17845107..17845193. Evidently, chromosomes x and 5 has the most (3) significantly DGE miRNA genes.Based on GO-based Gene Ontology analysis, the statistically differentially expressed miRNA genes are associated with regulation of phosphorous metabolism. The targetscan and getMIR approaches were both used to determine the mRNA gene targets of these identified miRNA genes. Of all in the DGE miRNA gene list, only 3 were successfully queried with get_multimir to identify their mRNA targets. Of all identified targets of these 3, only CDKN1A target of hsa-miR-1248 and SERBP1 target of hsa-miR-107 appear distantly related (by gene symbol similarity) to the RNA-seq DGE genes of CDKN2A and SERPINE1. Using the targetscan approach, the expression of miRNA gene hsa-let-7i was found to be significantly correlated with expression of protein-coding genes CASP3 and GAB2. Unfortunately, the function-based automatic conversion of miRNA-seq Summarized Experiment to Ranged Summarized Experiment split the Summarized Experiment into Ranged and Unranged sets and not subsequently used.

Regarding GISTIC CNV Summarized Experiment:

The ward.D2 hierarchal clustering did not appear to reflect the segregation of 5 old and 5 young patients.Based on PCA analysis, there did not appear to be segregation by age status for gene-based GISTIC recurrent region state values,and a significant total of 21.26%+ 29.4 %= 50.66% variance was accounted for by the first 2 principal components PC1 and PC2 and corresponding eigenvector values. Multiple simple linear regression was performed on all gene GISTIC CNV values (dependent variables) and categorical factor age/old age.status (independent variable), and the gene that had the lowest p-value for differential GISTIC cnv value with respect to young/old age status was FOXO3.The readGistic function was explored to read in files provided manually after obtaining them via TCGAUtils or a directory containing GISTIC results and import all the relevant files. However, we were not successful at obtaining the required “all-lesions_CV.txt” file but were successful at graphically depicting GISTIC peak regions via associated plotting functions. Furthermore, the associated ACC “CNV INdividual Calls” Summarized Experiment with assays matrix was successfully downloaded via query from TCGA and added to our original miniACC MultiExperiment object, but equalization of samples and patients with the other data blocks could not be done. Therefore, the CNVRanger package and associated functions were used to further analyze our gene-wide GISTIC CNV recurrent lesions Summarized Experiment by assuming instead that the GISTIC data represented original “individual calls” that was subsequently converted to GISTIC summarized population recurrent gene-based lesion regions. A resulting CNVRanger permutation test p-value indicated a significant depletion where Out of the 197 CNV regions (cnvrs object), 33 overlapped with at least one gene.The CNVRanger findOverlaps function from the GenomicRanges package was a general function for finding overlaps between two sets of genomic regions and was used to find protein-coding genes overlapping aforementioned 33 summarized CNV regions.

Correlation between CNV and mRNA-seq Data Blocks:

Differential expression of genes in the neighborhood of CNV region of interest # 1,2,3,4,8,9,13,16,23,34,35 were visually illustrated via CNVRanger function plotEQTL. Furthermore, when correlating RAW (unfiltered, non-normalized, non-transformed) mRNA-seq and GISTIC CNV assay data, the following 12 genes were identified to be strongly correlated across all patients (young and old combined): “ATM”, “ACVRL1”, “TSC1”,“GSK3A”, “KEAP1”, “XRCC1”, “NFKB1”, “NF2”, “MYH9”, “YWHAB”, “MSH2”, and “DIABLO”. Furthermore, 44 and 50 genes were significantly correlated across the 5 selected old and young patients, respectively. MFA for CNV and mRNA-seq only data that had been filtered, TPM-normalized, log-transformed, scaled showed segregation between old and young patients. For this MFA the first dimension revealed highest contribution from mRNA-seq gene expresion (SMAD1,SRC, PIK3R1, PRKAA1, AKT3, NFKB1, MAPK9, AKT1, PRKCA, SQSTM1) and the second dimension revealed highest contribution from GISTIC CNV gene copy number variation (SRC,TGM2, E2F1,NCOA3, BCL2L1, PRKAA1, YWHAB, PREX1, CDKN1B,ERBB3)

Correlation between mRNA-seq and miRNA-seq Data Blocks:

The expression of miRNA gene hsa-let-7i was found to be significantly correlated with expression of protein-coding genes CASP3 and GAB2.

MFA between mRNA-seq, miRNA-seq, and GISTIC CNV Data Blocks

The filtered, TPM-transformed, log-transformed, scaled data of all 3 data block Summarized Experiments for mRNA-seq, miRNA-seq, and GISTIC CNV were jointly evaluated via MFA. Overall, Multi-FActor Analysis (MFA) helps elucidate the underlying structure of the data by reducing its dimensionality and highlighting the relationships between variables and observations. Based on MFA summary eigenvalues, the first three dimensions of MFA capture 57.77% (24.66% (dim1)+18.85% (dim2) + 14.268 (dim3)) of total variance. Based on MFA summary group analysis, compared to GISTIC cnv recurrent lesions, the miRNA-seq and mRNA-seq variables co-contribute most and have highest significant impact to the first dimension, while GISTIC cnv contributes the most towards dimension#2 (0.9 vs. 0.009). The top genes impacting dimension#1 are (from mRNA-seq data block variable) SMAD1,SRC, PIK3R1, PRKAA1, AKT3, NFKB1, MAPK9, AKT1, PRKCA. and SQSTM1. The top genes impacting dimension#2(from GISTIC CNV gene-based recurrent lesions data block variable) are SRC, TGM2, E2F1, NCOA3, BCL2L1, PRKAA1, YWHAB, PREX1, CDKN1B, and ERBB3. The top genes impacting dimension#3(from miRNA-seq data block variable) are hsa.mir.196a.2,hsa.mir.106b, hsa.mir.196a.1, hsa.mir.25, hsa.mir.16.2, hsa.mir.196b, hsa.mir.92a.2, and (from mRNA-seq data block)CDK1, FOXM1,and ACACB. Based on MFA analysis, there is clear separation between cnv, mRNA, and miRNA block data. Based on individuals Analysis examining how individual data points relate to each dimension, the first ten individuals show their positions in the multidimensional space.No clear segregation between young and old patient samples is apparent. Of the ten selected patient samples, A5J9 (young), A5JF(old),A5JI(young),A5K0(old),A5L5(old),A5LL(old) contribute positive coefficients towards dimension#1, while A5JE (young), A5KV(young),A5LC(old),A5LE(young) contribute negative coefficients towards dimension#1. Young Patients TCGA.OR.A5LE, A5J9, A5JE appear to be outliers. Old patients A5K0, A5LL, A5JF, and A5LC appear to be outliers, suggesting that the 10 patients selected were not appropriate for this Integrated Genomics study. The mRNA expression dimension seem to coincide with the age.status condition more than the other 2 data blocks.Based on MFA continuous Variables analysis, which indicates the relationship between the original variables, and the extracted dimensions, the mRNA-seq data block genes strongly influence Dimension 1 compared to miRNA-seq and GISTIC CNV data block variables.mRNA-seq data block quantitative variables contributed the most towards dimension#1 compared to miRNA-seq and GISTIC CNV recurrent lesions data block variables.

EXPLORATION OF MULTIASSAY EXPERIMENT, SELECTION OF 10 YOUNG/OLD PATIENTS, EQUALIZATION OF PATIENTS/SAMPLES, AND SEPARAITON OF SUMMARIZED EXPERIMENTS

#EXPLORE miniACC MultiAssayExperiment:
data(miniACC)
class(miniACC)
## [1] "MultiAssayExperiment"
## attr(,"package")
## [1] "MultiAssayExperiment"
miniACC
## A MultiAssayExperiment object of 5 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 5:
##  [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
##  [2] gistict: SummarizedExperiment with 198 rows and 90 columns
##  [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
##  [4] Mutations: matrix with 97 rows and 90 columns
##  [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files
#RNASeq2GeneNorm
#RNA-seq count data: an ExpressionSet with 198 rows and 79 columns
#gistict
#Reccurent copy number lesions identified by GISTIC2: a SummarizedExperiment with 198 rows and 90 columns
#RPPAArray
#Reverse Phase  Protein Array: an ExpressionSet with 33 rows and 46 columns. Rows are indexed by genes, but  protein annotations are available from featureData(miniACC[["RPPAArray"]]). The source of these annotations is noted in abstract(miniACC[["RPPAArray"]])
#Mutations
#Somatic mutations: a matrix with 223 rows and 90 columns. 1 for any kind of non-silent mutation, zero for silent (synonymous) or no mutation.
#miRNASeqGene
#microRNA sequencing: an ExpressionSet with 471 rows and 80 columns. Rows not having at least 5 counts in at least 5 samples were removed.

#This dataset provides five assays on 92 patients, although all five assays were not performed for every patient:
upsetSamples(miniACC)

#This graph depicts the overlapping patients fro all 5 assays

colData(miniACC)
## DataFrame with 92 rows and 30 columns
##                 patientID years_to_birth vital_status days_to_death
##               <character>      <integer>    <integer>     <integer>
## TCGA-OR-A5J1 TCGA-OR-A5J1             58            1          1355
## TCGA-OR-A5J2 TCGA-OR-A5J2             44            1          1677
## TCGA-OR-A5J3 TCGA-OR-A5J3             23            0            NA
## TCGA-OR-A5J4 TCGA-OR-A5J4             23            1           423
## TCGA-OR-A5J5 TCGA-OR-A5J5             30            1           365
## ...                   ...            ...          ...           ...
## TCGA-PK-A5H9 TCGA-PK-A5H9             27            0            NA
## TCGA-PK-A5HA TCGA-PK-A5HA             63            0            NA
## TCGA-PK-A5HB TCGA-PK-A5HB             63            0            NA
## TCGA-PK-A5HC TCGA-PK-A5HC             44            0            NA
## TCGA-P6-A5OG TCGA-P6-A5OG             45            1           383
##              days_to_last_followup tumor_tissue_site pathologic_stage
##                          <integer>       <character>      <character>
## TCGA-OR-A5J1                    NA           adrenal         stage ii
## TCGA-OR-A5J2                    NA           adrenal         stage iv
## TCGA-OR-A5J3                  2091           adrenal        stage iii
## TCGA-OR-A5J4                    NA           adrenal         stage iv
## TCGA-OR-A5J5                    NA           adrenal        stage iii
## ...                            ...               ...              ...
## TCGA-PK-A5H9                   616           adrenal         stage ii
## TCGA-PK-A5HA                  1201           adrenal          stage i
## TCGA-PK-A5HB                  1293           adrenal               NA
## TCGA-PK-A5HC                   679           adrenal        stage iii
## TCGA-P6-A5OG                    NA           adrenal         stage iv
##              pathology_T_stage pathology_N_stage      gender
##                    <character>       <character> <character>
## TCGA-OR-A5J1                t2                n0        male
## TCGA-OR-A5J2                t3                n0      female
## TCGA-OR-A5J3                t3                n0      female
## TCGA-OR-A5J4                t3                n1      female
## TCGA-OR-A5J5                t4                n0        male
## ...                        ...               ...         ...
## TCGA-PK-A5H9                t2                n0      female
## TCGA-PK-A5HA                t1                n0        male
## TCGA-PK-A5HB                NA                NA        male
## TCGA-PK-A5HC                t4                n0      female
## TCGA-P6-A5OG                t4                n0      female
##              date_of_initial_pathologic_diagnosis radiation_therapy
##                                         <integer>       <character>
## TCGA-OR-A5J1                                 2000                no
## TCGA-OR-A5J2                                 2004                no
## TCGA-OR-A5J3                                 2008                no
## TCGA-OR-A5J4                                 2000                no
## TCGA-OR-A5J5                                 2000                no
## ...                                           ...               ...
## TCGA-PK-A5H9                                 2012                no
## TCGA-PK-A5HA                                 2011                no
## TCGA-PK-A5HB                                 2003               yes
## TCGA-PK-A5HC                                 2011                no
## TCGA-P6-A5OG                                 2011                no
##                   histological_type residual_tumor number_of_lymph_nodes
##                         <character>    <character>             <integer>
## TCGA-OR-A5J1 adrenocortical carci..             r0                    NA
## TCGA-OR-A5J2 adrenocortical carci..             r2                     0
## TCGA-OR-A5J3 adrenocortical carci..             r0                     0
## TCGA-OR-A5J4 adrenocortical carci..             r2                     2
## TCGA-OR-A5J5 adrenocortical carci..             r2                    NA
## ...                             ...            ...                   ...
## TCGA-PK-A5H9 adrenocortical carci..             r0                    NA
## TCGA-PK-A5HA adrenocortical carci..             r0                     0
## TCGA-PK-A5HB adrenocortical carci..             NA                    NA
## TCGA-PK-A5HC adrenocortical carci..             r1                     0
## TCGA-P6-A5OG adrenocortical carci..             r2                     0
##                     race              ethnicity   Histology     C1A.C1B
##              <character>            <character> <character> <character>
## TCGA-OR-A5J1       white                     NA  Usual Type         C1A
## TCGA-OR-A5J2       white     hispanic or latino  Usual Type         C1A
## TCGA-OR-A5J3       white     hispanic or latino  Usual Type         C1A
## TCGA-OR-A5J4       white     hispanic or latino  Usual Type          NA
## TCGA-OR-A5J5       white     hispanic or latino  Usual Type         C1A
## ...                  ...                    ...         ...         ...
## TCGA-PK-A5H9       asian not hispanic or latino  Usual Type         C1B
## TCGA-PK-A5HA          NA                     NA  Usual Type         C1B
## TCGA-PK-A5HB          NA                     NA  Usual Type         C1A
## TCGA-PK-A5HC       asian not hispanic or latino  Usual Type          NA
## TCGA-P6-A5OG       white not hispanic or latino          NA          NA
##                             mRNA_K4        MethyLevel miRNA.cluster
##                         <character>       <character>   <character>
## TCGA-OR-A5J1 steroid-phenotype-hi..         CIMP-high       miRNA_1
## TCGA-OR-A5J2 steroid-phenotype-hi..          CIMP-low       miRNA_1
## TCGA-OR-A5J3 steroid-phenotype-high CIMP-intermediate       miRNA_6
## TCGA-OR-A5J4                     NA         CIMP-high       miRNA_6
## TCGA-OR-A5J5 steroid-phenotype-high CIMP-intermediate       miRNA_2
## ...                             ...               ...           ...
## TCGA-PK-A5H9  steroid-phenotype-low          CIMP-low       miRNA_5
## TCGA-PK-A5HA  steroid-phenotype-low         CIMP-high       miRNA_5
## TCGA-PK-A5HB steroid-phenotype-high         CIMP-high       miRNA_6
## TCGA-PK-A5HC                     NA                NA            NA
## TCGA-P6-A5OG                     NA                NA            NA
##              SCNA.cluster protein.cluster         COC    OncoSign    purity
##               <character>       <integer> <character> <character> <numeric>
## TCGA-OR-A5J1        Quiet              NA        COC3         CN2      0.90
## TCGA-OR-A5J2        Noisy               1        COC3    TP53/NF1      0.89
## TCGA-OR-A5J3  Chromosomal               3        COC2         CN2      0.93
## TCGA-OR-A5J4  Chromosomal              NA          NA         CN1      0.87
## TCGA-OR-A5J5  Chromosomal              NA        COC2    TP53/NF1      0.93
## ...                   ...             ...         ...         ...       ...
## TCGA-PK-A5H9        Quiet               3        COC1    TP53/NF1      0.79
## TCGA-PK-A5HA  Chromosomal               2        COC1         CN2      0.83
## TCGA-PK-A5HB        Noisy              NA        COC3    TP53/NF1      0.88
## TCGA-PK-A5HC  Chromosomal              NA          NA    TP53/NF1      0.59
## TCGA-P6-A5OG           NA              NA          NA          NA        NA
##                 ploidy genome_doublings       ADS
##              <numeric>        <integer> <numeric>
## TCGA-OR-A5J1      1.95                0     -0.08
## TCGA-OR-A5J2      1.31                0     -0.84
## TCGA-OR-A5J3      1.25                0      1.18
## TCGA-OR-A5J4      2.60                1        NA
## TCGA-OR-A5J5      2.75                1     -1.00
## ...                ...              ...       ...
## TCGA-PK-A5H9      2.00                0     -0.85
## TCGA-PK-A5HA      1.69                0     -1.49
## TCGA-PK-A5HB      1.64                0     -0.31
## TCGA-PK-A5HC      2.53                1        NA
## TCGA-P6-A5OG        NA               NA        NA
#getClinicalNames(miniACC)

#Subset the MultiAssayExperiment to only include the three assays RNASeq2GeneNorm, gistict, and miRNASeqGene SummarizedExperiment
#multiassayexperiment[i = rownames, j = primary or colnames, k = assay]
miniACC.assays<-miniACC[, , c("RNASeq2GeneNorm", "gistict", "miRNASeqGene")]
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
##   removing 136 sampleMap rows not in names(experiments)
#complete.cases() shows which patients have complete data for all assays:
summary(complete.cases(miniACC.assays))
##    Mode   FALSE    TRUE 
## logical      15      77
#Subset MultiAssayExperiment to Obtain common samples
miniACC.assays.comp<-miniACC.assays[, complete.cases(miniACC.assays), ]
#complete.cases() shows which patients have complete data for all assays:
summary(complete.cases(miniACC.assays.comp))
##    Mode    TRUE 
## logical      77
colData(miniACC.assays.comp)$patientID
##  [1] "TCGA-OR-A5J1" "TCGA-OR-A5J2" "TCGA-OR-A5J3" "TCGA-OR-A5J5" "TCGA-OR-A5J6"
##  [6] "TCGA-OR-A5J7" "TCGA-OR-A5J8" "TCGA-OR-A5J9" "TCGA-OR-A5JA" "TCGA-OR-A5JB"
## [11] "TCGA-OR-A5JC" "TCGA-OR-A5JD" "TCGA-OR-A5JE" "TCGA-OR-A5JF" "TCGA-OR-A5JG"
## [16] "TCGA-OR-A5JI" "TCGA-OR-A5JJ" "TCGA-OR-A5JK" "TCGA-OR-A5JL" "TCGA-OR-A5JM"
## [21] "TCGA-OR-A5JO" "TCGA-OR-A5JP" "TCGA-OR-A5JQ" "TCGA-OR-A5JR" "TCGA-OR-A5JS"
## [26] "TCGA-OR-A5JT" "TCGA-OR-A5JV" "TCGA-OR-A5JW" "TCGA-OR-A5JX" "TCGA-OR-A5JY"
## [31] "TCGA-OR-A5JZ" "TCGA-OR-A5K0" "TCGA-OR-A5K1" "TCGA-OR-A5K2" "TCGA-OR-A5K3"
## [36] "TCGA-OR-A5K4" "TCGA-OR-A5K5" "TCGA-OR-A5K6" "TCGA-OR-A5K8" "TCGA-OR-A5K9"
## [41] "TCGA-OR-A5KO" "TCGA-OR-A5KT" "TCGA-OR-A5KU" "TCGA-OR-A5KV" "TCGA-OR-A5KW"
## [46] "TCGA-OR-A5KX" "TCGA-OR-A5KY" "TCGA-OR-A5KZ" "TCGA-OR-A5L3" "TCGA-OR-A5L4"
## [51] "TCGA-OR-A5L5" "TCGA-OR-A5L6" "TCGA-OR-A5L8" "TCGA-OR-A5L9" "TCGA-OR-A5LA"
## [56] "TCGA-OR-A5LB" "TCGA-OR-A5LC" "TCGA-OR-A5LD" "TCGA-OR-A5LE" "TCGA-OR-A5LG"
## [61] "TCGA-OR-A5LH" "TCGA-OR-A5LJ" "TCGA-OR-A5LK" "TCGA-OR-A5LL" "TCGA-OR-A5LM"
## [66] "TCGA-OR-A5LN" "TCGA-OR-A5LO" "TCGA-OR-A5LP" "TCGA-OR-A5LR" "TCGA-OR-A5LS"
## [71] "TCGA-OR-A5LT" "TCGA-OU-A5PI" "TCGA-PA-A5YG" "TCGA-PK-A5H9" "TCGA-PK-A5HA"
## [76] "TCGA-PK-A5HB" "TCGA-P6-A5OG"
#More simply, intersectColumns() will select complete cases and rearrange each ExperimentList element 
#so its columns correspond exactly to rows of colData in the same order:
#miniACC.assays.comp=intersectColumns(miniACC.assays)


#The column names of the assays in miniACC.sub.compmatch are not the same because of assay-specific identifiers, 
#but they have been automatically re-arranged to correspond to the same patients. In these TCGA assays,
#the first three - delimited positions correspond to patient, i.e. the first patient is TCGA-OR-A5J1:
colnames(miniACC.assays.comp)
## CharacterList of length 3
## [["RNASeq2GeneNorm"]] TCGA-OR-A5J1-01A-11R-A29S-07 ...
## [["gistict"]] TCGA-OR-A5J1-01A-11D-A29H-01 ... TCGA-P6-A5OG-01A-22D-A29H-01
## [["miRNASeqGene"]] TCGA-OR-A5J1-01A-11R-A29W-13 ...
#intersectRows() keeps only rows that are common to each assay, and aligns them in identical order
#miniACC.assays.comp2 <- intersectRows(miniACC.assays.comp[, , c("RNASeq2GeneNorm","gistict","miRNASeqGene")])
rownames(miniACC.assays.comp)
## CharacterList of length 3
## [["RNASeq2GeneNorm"]] DIRAS3 MAPK14 YAP1 CDKN1B ... CHGA IDH3A SQSTM1 KCNJ13
## [["gistict"]] DIRAS3 MAPK14 YAP1 CDKN1B ERBB2 ... CHGA IDH3A SQSTM1 KCNJ13
## [["miRNASeqGene"]] hsa-let-7a-1 hsa-let-7a-2 ... hsa-mir-99a hsa-mir-99b
#Obtain age variable and study its frequency on the common samples. We will take variable years_to_birth
years_to_birth  <- colData(miniACC.assays.comp)$years_to_birth 
table(years_to_birth )
## years_to_birth
## 14 17 22 23 25 26 27 29 30 31 32 34 36 37 39 40 42 44 45 46 47 48 49 50 51 52 
##  1  2  2  3  2  2  1  1  3  1  1  1  3  3  2  1  1  2  2  1  1  2  1  1  1  3 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 71 75 77 
##  4  1  2  1  2  1  2  2  3  1  2  1  3  1  1  1  2  1  1  1
# plotting integer vector
barplot(years_to_birth, xlab = "Barplot of Patient Age",ylab = "Count", col = "white",col.axis = "darkgreen",col.lab = "darkgreen")

hist(years_to_birth, main = "Histogram of Patient Age",xlab = "Values",col.lab = "darkgreen",col.main = "darkgreen") 

#Plot the histogram and overlay the density
hist(years_to_birth, freq = FALSE)
lines(density(years_to_birth))

#Then, we see that the distribution is normal and not bi-modal 

#We use fitdistrplus package that provides tools for distribution fitting. 
descdist(years_to_birth, discrete = FALSE) 

## summary statistics
## ------
## min:  14   max:  77 
## median:  49 
## mean:  46.64935 
## estimated sd:  15.94049 
## estimated skewness:  -0.2132373 
## estimated kurtosis:  2.004782
#Now we attempt to fit different distributions:
normal_dist <- fitdist(years_to_birth, "norm")
#and inspect the fit:
plot(normal_dist)

#Now we attempt to fit different distributions:
binomial_dist <- fitdist(years_to_birth, "binom", fix.arg=list(size=77), start=list(prob=0.3))
#and inspect the fit:
plot(binomial_dist) 

#We determine that years_to_birth follows a normal distribution

#The mean and SD are appropriate if the variable is somewhat symmetric. However, they can be misleading
#if the data are skewed (non-symmetric distribution) or there are outliers.
#The median and IQR can be used with any variable, but are typically used as an alternative to the mean 
#and SD when the variable is skewed (not symmetric) or there are outliers since they are robust to skew and outliers.
#“Outliers” are values that are far away from the bulk of the values.

#Using the following functions to compute these statistics and study the continuous variable :

is.na (years_to_birth)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE
sum(is.na(years_to_birth)) # Number of missing values
## [1] 0
mean(years_to_birth, na.rm = T)
## [1] 46.64935
sd(years_to_birth, na.rm = T)
## [1] 15.94049
min(years_to_birth, na.rm = T)
## [1] 14
max(years_to_birth, na.rm = T)
## [1] 77
median(years_to_birth, na.rm = T)
## [1] 49
IQR(years_to_birth, na.rm = T)
## [1] 26
quantile(years_to_birth, probs = c(0,0.25,0.5,0.75,1))
##   0%  25%  50%  75% 100% 
##   14   34   49   60   77
#df %>%
#  group_by(n < 0) %>%
#  top_n(2, abs(n)) %>%
# ungroup()

length(years_to_birth)
## [1] 77
#Extracting lowest 5 ages and highest 5 ages (low and high tails of normal distribution). Evaluating young patient ages in distribution
sort(years_to_birth)[1:5]
## [1] 14 17 17 22 22
#We have 5 unique values to choose in this range.Therefore:
young<-c(sort(years_to_birth)[1:5]) 
young
## [1] 14 17 17 22 22
#Evaluating old patient ages in distribution
old<-sort(years_to_birth,decreasing=F)[length(years_to_birth):(length(years_to_birth)-4)]
old
## [1] 77 75 71 69 69
#Now subset multi-assay experiment to only include those corresponding patients with selected age
combined.age<-c(young, old)
combined.age
##  [1] 14 17 17 22 22 77 75 71 69 69
#Subsetting according to age of young and old patients
#multiassayexperiment[i = rownames, j = primary or colnames, k = assay]
selected.age <- miniACC.assays.comp$years_to_birth %in% combined.age
selected.age
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [61] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE
miniACC.assays.comp.age<-miniACC.assays.comp[, miniACC.assays.comp$years_to_birth %in% combined.age , ]

#Remove NA values from vector
#miniACC.comp.age.na<-miniACC.comp.age[, !is.na(miniACC.comp.age$years_to_birth %in% combined.age), ]

#Obtain common samples
#miniACC.sub.compmatch.age.na <- miniACC.sub.compmatch.age[, complete.cases(miniACC.sub.compmatch.age), ]
#miniACC.sub.compmatch.age

colData(miniACC.assays.comp.age)$patientID
##  [1] "TCGA-OR-A5J9" "TCGA-OR-A5JE" "TCGA-OR-A5JF" "TCGA-OR-A5JI" "TCGA-OR-A5K0"
##  [6] "TCGA-OR-A5KV" "TCGA-OR-A5L5" "TCGA-OR-A5LC" "TCGA-OR-A5LE" "TCGA-OR-A5LL"
#head(str(miniACC.assays.comp.age))
#Confirm dimensions and that correct indexes were used in extraction:
experiments(miniACC.assays.comp.age)
## ExperimentList class object of length 3:
##  [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 10 columns
##  [2] gistict: SummarizedExperiment with 198 rows and 10 columns
##  [3] miRNASeqGene: SummarizedExperiment with 471 rows and 10 columns
sampleMap(miniACC.assays.comp.age)
## DataFrame with 30 rows and 3 columns
##               assay      primary                colname
##            <factor>  <character>            <character>
## 1   RNASeq2GeneNorm TCGA-OR-A5J9 TCGA-OR-A5J9-01A-11R..
## 2   RNASeq2GeneNorm TCGA-OR-A5JE TCGA-OR-A5JE-01A-11R..
## 3   RNASeq2GeneNorm TCGA-OR-A5JF TCGA-OR-A5JF-01A-11R..
## 4   RNASeq2GeneNorm TCGA-OR-A5JI TCGA-OR-A5JI-01A-11R..
## 5   RNASeq2GeneNorm TCGA-OR-A5K0 TCGA-OR-A5K0-01A-11R..
## ...             ...          ...                    ...
## 26     miRNASeqGene TCGA-OR-A5KV TCGA-OR-A5KV-01A-11R..
## 27     miRNASeqGene TCGA-OR-A5L5 TCGA-OR-A5L5-01A-11R..
## 28     miRNASeqGene TCGA-OR-A5LC TCGA-OR-A5LC-01A-11R..
## 29     miRNASeqGene TCGA-OR-A5LE TCGA-OR-A5LE-01A-11R..
## 30     miRNASeqGene TCGA-OR-A5LL TCGA-OR-A5LL-01A-11R..
metadata(miniACC.assays.comp.age)
## $title
## [1] "Comprehensive Pan-Genomic Characterization of Adrenocortical Carcinoma"
## 
## $PMID
## [1] "27165744"
## 
## $sourceURL
## [1] "http://s3.amazonaws.com/multiassayexperiments/accMAEO.rds"
## 
## $RPPAfeatureDataURL
## [1] "http://genomeportal.stanford.edu/pan-tcga/show_target_selection_file?filename=Allprotein.txt"
## 
## $colDataExtrasURL
## [1] "http://www.cell.com/cms/attachment/2062093088/2063584534/mmc3.xlsx"
#Subset each each omics data (study object class and data type). We subset out each complete SummarizedExperiment we are interested in for separate, 
#individual evaluation and for determining if samples are aligned
mACC.exp3 <- miniACC.assays.comp.age[[1]] #SummarizedExperiment
mACC.CN3 <- miniACC.assays.comp.age[[2]] #SummarizedExperiment
mACC.mir3 <- miniACC.assays.comp.age[[3]] #SummarizedExperiment

#data types
range(assay(mACC.exp3))
## [1]      0.0 206162.3
table(assay(mACC.CN3))
## 
##   -2   -1    0    1    2 
##    3  336 1066  565   10
range(assay(mACC.mir3))
## [1]       0 2753979
rowData(mACC.exp3)
## DataFrame with 198 rows and 0 columns
metadata(mACC.exp3)
## $experimentData
## Experiment data
##   Experimenter name:  
##   Laboratory:  
##   Contact information:  
##   Title:  
##   URL:  
##   PMIDs:  
##   No abstract available.
## 
## $annotation
## character(0)
## 
## $protocolData
## An object of class 'AnnotatedDataFrame': none
#Need to make sure that we have the same SAMPLES
s.exp3 <- substr(colnames(mACC.exp3),1,15)
s.CN3 <- substr(colnames(mACC.CN3),1,15)
s.mir3 <- substr(colnames(mACC.mir3),1,15)

s.common3 <- intersect(intersect(s.exp3,s.CN3),s.mir3)

TCGAutils::sampleTables(miniACC.assays.comp.age)
## $RNASeq2GeneNorm
## 
## 01 
## 10 
## 
## $gistict
## 
## 01 
## 10 
## 
## $miRNASeqGene
## 
## 01 
## 10
data(sampleTypes, package="TCGAutils")
sampleTypes
##    Code                                        Definition Short.Letter.Code
## 1    01                               Primary Solid Tumor                TP
## 2    02                             Recurrent Solid Tumor                TR
## 3    03   Primary Blood Derived Cancer - Peripheral Blood                TB
## 4    04      Recurrent Blood Derived Cancer - Bone Marrow              TRBM
## 5    05                          Additional - New Primary               TAP
## 6    06                                        Metastatic                TM
## 7    07                             Additional Metastatic               TAM
## 8    08                        Human Tumor Original Cells              THOC
## 9    09        Primary Blood Derived Cancer - Bone Marrow               TBM
## 10   10                              Blood Derived Normal                NB
## 11   11                               Solid Tissue Normal                NT
## 12   12                                Buccal Cell Normal               NBC
## 13   13                           EBV Immortalized Normal              NEBV
## 14   14                                Bone Marrow Normal               NBM
## 15   15                                    sample type 15              15SH
## 16   16                                    sample type 16              16SH
## 17   20                                   Control Analyte             CELLC
## 18   40 Recurrent Blood Derived Cancer - Peripheral Blood               TRB
## 19   50                                        Cell Lines              CELL
## 20   60                          Primary Xenograft Tissue                XP
## 21   61                Cell Line Derived Xenograft Tissue               XCL
## 22   99                                    sample type 99              99SH
#Select 01=Primary Solid tumor

#All samples are tumoral TP
mACC.exp.m3 <- assay(mACC.exp3)
mACC.exp.c3 <- mACC.exp.m3[,grep(paste(s.common3,collapse="|"),colnames(mACC.exp.m3),value = T)]

mACC.CN.m3 <- assay(mACC.CN3)
mACC.CN.c3 <- mACC.CN.m3[,grep(paste(s.common3,collapse="|"),colnames(mACC.CN.m3),value = T)]

mACC.mir.m3 <- assay(mACC.mir3)
mACC.mir.c3 <- mACC.mir.m3[,grep(paste(s.common3,collapse="|"),colnames(mACC.mir.m3),value = T)]

#check order and years_to_birth variable
cd3 <- colData(miniACC.assays.comp.age)

all.equal(rownames(cd3),substr(colnames(mACC.exp.c3),1,12))
## [1] TRUE
all.equal(rownames(cd3),substr(colnames(mACC.CN.c3),1,12))
## [1] TRUE
all.equal(rownames(cd3),substr(colnames(mACC.mir.c3),1,12))
## [1] TRUE
#ALL TRUE 

# GLOBAL MFA variables
exp.l3<-nrow(mACC.exp.c3)
cn.l3<-nrow(mACC.CN.c3)
mir.l3<-nrow(mACC.mir.c3)

#Convert integer vector into factor with 2 levels (old, young) based on condition
colData(miniACC.assays.comp.age)$years_to_birth <- factor(ifelse(colData(miniACC.assays.comp.age)$years_to_birth>=68, "old", "young"))
table(colData(miniACC.assays.comp.age)$years_to_birth) 
## 
##   old young 
##     5     5
cond2<-colData(miniACC.assays.comp.age)$years_to_birth
cond2
##  [1] young young old   young old   young old   old   young old  
## Levels: old young
#Will later Confirm same patient ID and sample order

##############################################################################################################################################
#TO LATER IMPLEMENT THE CNVRanger function eqtl FOR CO-mRNA/CNV ANALYSIS, we require the initial INDIVIDUAL CNV CALL counts matrix and experiment that later 
#gets processed into the GISTIC CNV GENE-BASED PEAK Experiment WHICH WE HAVE ALREADY FROM miniACC. THEREFORE, WE OBTAIN THIS INDIVIDUAL CNV CALL 
#EXPERIMENT FROM TCGA AND ADD IT TO ORIGINAL MULTIASSAY EXPERIMENT OBJECT AS FOLLOWS:

miniACC.assays.comp.age.cnvcalls<-miniACC.assays.comp.age

cnv <- curatedTCGAData(diseaseCode = "ACC",assays = c("*CNV*"), version="1.1.38",dry.run = FALSE)
## Querying and downloading: ACC_CNVSNP-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: ACC_colData-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: ACC_metadata-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## Querying and downloading: ACC_sampleMap-20160128
## see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
## loading from cache
## harmonizing input:
##   removing 825 sampleMap rows not in names(experiments)
test<-cnv[[1]]

#The c function allows the user to concatenate an additional experiment to an existing MultiAssayExperiment. 
#The optional sampleMap argument allows concatenating an assay whose column names do not match the row names of colData. 
#For convenience, the mapFrom argument (mapFrom=1L) allows the user to map from a particular experiment provided that the order of the colnames is in the same. 
#A warning will be issued to make the user aware of this assumption.mapFrom=1L,mapFrom=1L

miniACC.assays.comp.age.cnvcalls<-c(miniACC.assays.comp.age.cnvcalls, newassay=cnv)
## Warning in `[<-.factor`(`*tmp*`, ri, value = c(58L, 44L, 23L, 23L, 30L, :
## invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, ri, value = c(58L, 44L, 23L, 23L, 30L, :
## invalid factor level, NA generated
#To annotate the genomic coordinates of the genes measured in the RNA-seq assay, we use the function symbolsToRanges from the TCGAutils package. 
#In the cases where row annotations indicate gene symbols, the symbolsToRanges utility function converts genes to genomic ranges and replaces existing assays 
#with RangedSummarizedExperiment objects. Gene annotations are given as 'hg19' genomic regions.
#Name of the genome is typically the name of an NCBI assembly (e.g. GRCh38.p13, WBcel235, TAIR10.1, ARS-UCD1.2, etc...) 
#or UCSC genome(e.g. hg38, bosTau9, galGal6, ce11, etc...)

miniACC.assays.comp.age.cnvcalls.ranges <- TCGAutils::symbolsToRanges(miniACC.assays.comp.age.cnvcalls, unmapped=FALSE)
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
##   removing 20 sampleMap rows not in names(experiments)
#microRNA assays obtained from curatedTCGAData have annotated sequences that can be converted to genomic ranges using the mirbase.db package. 
#The function looks up all sequences and converts them to ('hg19') ranges. For those rows that cannot be found, an 'unranged' assay is introduced in the resulting MultiAssayExperiment object.
miniACC.assays.comp.age.cnvcalls.ranges  <- mirToRanges(miniACC.assays.comp.age.cnvcalls.ranges)
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## harmonizing input:
##   removing 10 sampleMap rows not in names(experiments)
#for(i in 1:4) 
#{
#  rr <- rowRanges(miniACC.assays.comp.age.cnvcalls.ranges[[i]])
#  GenomeInfoDb::genome(rr) <- "hg19"
#  GenomeInfoDb::seqlevelsStyle(rr) <- "UCSC"
#  ind <- as.character(seqnames(rr)) %in% c("chr1","chr2","chr3", "chr4","chr5", "chr6","chr7", "chr8", "chr9","chr10","chr11", "chr12","chr13", "chr14","chr15", "chr16", "chr17", "chr18","chr19","chr20", "chr21","chr22", "chr23", "chrx")
#  rowRanges(miniACC.assays.comp.age.cnvcalls.ranges[[i]]) <- rr
#  miniACC.assays.comp.age.cnvcalls.ranges[[i]] <- miniACC.assays.comp.age.cnvcalls.ranges[[i]][ind,]
#}
#miniACC.assays.comp.age.cnvcalls.ranges

#We now restrict the analysis to intersecting patients of the three assays using MultiAssayExperiment’s intersectColumns function, 
#and select Primary Solid Tumor samples using the splitAssays function from the TCGAutils package.
#miniACC.assays.comp.age.cnvcalls <- MultiAssayExperiment::intersectColumns(miniACC.assays.comp.age.cnvcalls)
#miniACC.assays.comp.age.cnvcalls<-miniACC.assays.comp.age.cnvcalls[, miniACC.assays.comp.age.cnvcalls$patientID %in% 
#c("TCGA-OR-A5J9", "TCGA-OR-A5JE", "TCGA-OR-A5JF", "TCGA-OR-A5JI", "TCGA-OR-A5K0" ,"TCGA-OR-A5KV", "TCGA-OR-A5L5", "TCGA-OR-A5LC", "TCGA-OR-A5LE","TCGA-OR-A5LL" )  , ]
#miniACC.assays.comp.age.cnvcalls.ranges <- splitAssays(miniACC.assays.comp.age.cnvcalls.ranges, sampleCodes="01")
#Error: 'splitAssays' is not an exported object from 'namespace:TCGAutils'
#miniACC.assays.comp.age.cnvcalls.ranges <- splitAssays(miniACC.assays.comp.age.cnvcalls.ranges, c("01"))
#Error in splitAssays(miniACC.assays.comp.age.cnvcalls.ranges, c("01")) : 
#is.list(hitList) || is(hitList, "List") is not TRUE 

#Extracting individual summarized experiments which will be henceforth individually analyzed:
cnv_calls<-miniACC.assays.comp.age.cnvcalls.ranges[[1]]
cnv_calls
## class: RaggedExperiment 
## dim: 21052 180 
## assays(2): Num_Probes Segment_Mean
## rownames: NULL
## colnames(180): TCGA-OR-A5J1-01A-11D-A29H-01
##   TCGA-OR-A5J1-10A-01D-A29K-01 ... TCGA-PK-A5HC-01A-11D-A309-01
##   TCGA-PK-A5HC-11A-11D-A309-01
## colData names(0):
#head(assays(cnv_calls)$Num_Probes)
#head(assays(cnv_calls)$Segment_Mean)
 
mRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[2]]
mRNA_expr
## class: RangedSummarizedExperiment 
## dim: 195 10 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(195): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(1): gene_id
## colnames(10): TCGA-OR-A5J9-01A-11R-A29S-07 TCGA-OR-A5JE-01A-11R-A29S-07
##   ... TCGA-OR-A5LE-01A-11R-A29S-07 TCGA-OR-A5LL-01A-11R-A29S-07
## colData names(0):
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
cnv_gistic
## class: RangedSummarizedExperiment 
## dim: 195 10 
## metadata(0):
## assays(1): ''
## rownames(195): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(4): Gene.Symbol Locus.ID Cytoband gene_id
## colnames(10): TCGA-OR-A5J9-01A-11D-A29H-01 TCGA-OR-A5JE-01A-11D-A29H-01
##   ... TCGA-OR-A5LE-01A-11D-A29H-01 TCGA-OR-A5LL-01A-11D-A29H-01
## colData names(0):
miRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[4]]
miRNA_expr
## class: RangedSummarizedExperiment 
## dim: 448 10 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(448): hsa-let-7a-1 hsa-let-7a-2 ... hsa-mir-99a hsa-mir-99b
## rowData names(1): mirna_id
## colnames(10): TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
##   ... TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## colData names(0):
miRNA_expr_unranged<-miniACC.assays.comp.age.cnvcalls.ranges[[5]]
miRNA_expr_unranged
## class: SummarizedExperiment 
## dim: 23 10 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(23): hsa-mir-103-1 hsa-mir-103-2 ... hsa-mir-663 hsa-mir-664
## rowData names(0):
## colnames(10): TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
##   ... TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## colData names(0):
#We will henceforth analyze the individual summarized experiments extracted from the MULTIASSAY EXPERIMENT miniACC.assays.comp.age 
#(1) PRIOR TO INCLUSION of the additional summarized experiment for INDIVIDUAL CNV_CALLS because we were unsuccessful in equalizing the samples and patients 
#across all summarized experiments including th individual calls experiment, and 
#(2) PRIOR TO EXECUTION OF THE mirToRanges function because this unfortunately segregated the miRNA Summarized Experpiment into Ranged and Unranged Experiments

mRNA-Seq DATA BLOCK ANALYSIS

#Preliminary analysis of individual extracted mRNA-seq Summarized Experiment:

#Creating a phenotype dataframe for mRNA expression:
phenoN <- data.frame(sample=colnames(mACC.exp.c3),patientID=colData(miniACC.assays.comp.age)$patientID, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
rownames(phenoN)<-phenoN$sample 

countsM <- as.matrix(assays(mACC.exp3)$exprs)

#These are identical matrixes
#The GENE IDs appear to be HGNC. For instance, DIRAS3 is HGNC symbol for Homo sapiens (human)family GTPase 3 according to website: https://www.ncbi.nlm.nih.gov/gene/9077

sum(is.na(countsM))
## [1] 0
#As part of the exploration, we plot data
boxplot(countsM) #They didn't apply log2 on the TMM for transformation

boxplot(log2(countsM+2))

#Check Library size
lSize <- colSums(countsM)
lSize #all sample sums < 1M (not = 1M as expected for TMM normalization) and non-homogeneous 
## TCGA-OR-A5J9-01A-11R-A29S-07 TCGA-OR-A5JE-01A-11R-A29S-07 
##                     533661.8                     698097.1 
## TCGA-OR-A5JF-01A-11R-A29S-07 TCGA-OR-A5JI-01A-11R-A29S-07 
##                     555528.0                     648939.3 
## TCGA-OR-A5K0-01A-11R-A29S-07 TCGA-OR-A5KV-01A-11R-A29S-07 
##                     562191.3                     727214.2 
## TCGA-OR-A5L5-01A-11R-A29S-07 TCGA-OR-A5LC-01A-11R-A29S-07 
##                     642192.1                     560695.8 
## TCGA-OR-A5LE-01A-11R-A29S-07 TCGA-OR-A5LL-01A-11R-A29S-07 
##                     592894.2                     511599.7
#We study total of reads per sample (library size).
sampleT <- apply(countsM, 2, sum)/10^6
range(sampleT)
## [1] 0.5115997 0.7272142
sampleTDF <- data.frame(sample=names(sampleT), total=sampleT)

p <- ggplot(aes(x=sample, y=sampleT, fill=sampleT), data=sampleTDF) + geom_bar(stat="identity")
p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("")

#One of the characteristics of RNA-seq data is that it contains a lot of zeros, 
#corresponding to genes that are not expressed. It is therefore important to remove genes that 
#consistently have zero or very low counts. In this case we will only keep genes that have at 
#least 10 reads in at least 4 samples. One recommendation for the number of samples would be set
#to the smallest group size. Our "old" group and "young" group  have 5 samples each (10 patients, 10 samples total)

keep <- rowSums(countsM > 10) >= 5 # at least 5 samples have 10 reads per gene
countsF <- countsM[keep,]

#There are several methods that can be used to normalize values in count matrices. 
#Traditionally, CPM (Counts Per Million), RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million) 
#were used to report RNA-seq results. However, TPM (Transcripts Per Kilobase Million) is now more popular. 
#CPM divide the counts by library size whereas RPKM/FPKM and TPM scale the data using gene length and library size. 
#When comparing samples, TMM (Trimmed Mean on the M-values) is the standard method to report results. Other methods 
#include also the GC content in the normalization step.

#Gene length
#To normalize using RPKM, FPKM or TPM we will need the gene length. 
#Let’s obtain this information throughout biomaRt. 
#The Cancer Genome Atlas (TCGA) uses GENCODE 36 (GRCh38/hg38) as a reference gene model
#The GENCODE annotation is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation.
#GENCODE version 36 corresponds to Ensembl 102 based on the website:https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=wgEncodeGencodeSuper
#We will take version 102 from the available archived. We will extract HGNC ID, chromosome, start and end to compute later or the gene length.

listMarts()
##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 112
## 2   ENSEMBL_MART_MOUSE      Mouse strains 112
## 3     ENSEMBL_MART_SNP  Ensembl Variation 112
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 112
listEnsemblArchives() 
##              name     date                                 url version
## 1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
## 2     Ensembl 112 May 2024 https://may2024.archive.ensembl.org     112
## 3     Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org     111
## 4     Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org     110
## 5     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
## 6     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
## 7     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
## 8     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
## 9     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
## 10    Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
## 11    Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
## 12    Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
## 13    Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
## 14    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
## 15     Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org      99
## 16     Ensembl 98 Sep 2019 https://sep2019.archive.ensembl.org      98
## 17     Ensembl 97 Jul 2019 https://jul2019.archive.ensembl.org      97
## 18     Ensembl 80 May 2015 https://may2015.archive.ensembl.org      80
## 19     Ensembl 77 Oct 2014 https://oct2014.archive.ensembl.org      77
## 20     Ensembl 75 Feb 2014 https://feb2014.archive.ensembl.org      75
## 21     Ensembl 54 May 2009 https://may2009.archive.ensembl.org      54
##    current_release
## 1                 
## 2                *
## 3                 
## 4                 
## 5                 
## 6                 
## 7                 
## 8                 
## 9                 
## 10                
## 11                
## 12                
## 13                
## 14                
## 15                
## 16                
## 17                
## 18                
## 19                
## 20                
## 21
#Taking version 102
listEnsembl(version = 102)
##         biomart                version
## 1         genes      Ensembl Genes 102
## 2 mouse_strains      Mouse strains 102
## 3          snps  Ensembl Variation 102
## 4    regulation Ensembl Regulation 102
ensembl102 <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl',version = 102)

#listDatasets(ensembl102)
attributes = listAttributes(ensembl102)
attributes[1:5,]
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
#searchAttributes(mart = ensembl102, pattern = "hgnc_symbol")
#searchAttributes(mart = ensembl102, pattern = "position")
#searchAttributes(mart = ensembl102, pattern = "length")
#searchAttributes(mart = ensembl102, pattern = "ensembl.*id")
searchAttributes(mart = ensembl102, pattern = "entrez.*id")
##             name                        description         page
## 79 entrezgene_id NCBI gene (formerly Entrezgene) ID feature_page
filters = listFilters(ensembl102)
filters[1:5,] 
##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
searchFilters(mart = ensembl102, pattern = "hgnc_symbol")
##           name                description
## 81 hgnc_symbol HGNC symbol(s) [e.g. A1BG]
searchFilters(mart = ensembl102, pattern = "hgnc_symbol")
##           name                description
## 81 hgnc_symbol HGNC symbol(s) [e.g. A1BG]
head(searchFilters(mart = ensembl102, pattern = "ensembl.*id"))
##                             name
## 56               ensembl_gene_id
## 57       ensembl_gene_id_version
## 58         ensembl_transcript_id
## 59 ensembl_transcript_id_version
## 60            ensembl_peptide_id
## 61    ensembl_peptide_id_version
##                                                       description
## 56                       Gene stable ID(s) [e.g. ENSG00000000003]
## 57       Gene stable ID(s) with version [e.g. ENSG00000000003.15]
## 58                 Transcript stable ID(s) [e.g. ENST00000000233]
## 59 Transcript stable ID(s) with version [e.g. ENST00000000233.10]
## 60                    Protein stable ID(s) [e.g. ENSP00000000233]
## 61     Protein stable ID(s) with version [e.g. ENSP00000000233.5]
gensInfo<-getBM(attributes=c("hgnc_symbol","ensembl_gene_id","chromosome_name","start_position","end_position","entrezgene_id","hgnc_symbol","description" ), filters=c("hgnc_symbol"), values=list(rownames(countsF)), mart=ensembl102)
gensInfo$length <- gensInfo$end_position - gensInfo$start_position
range(gensInfo$length)
## [1]   2403 824272
dim(gensInfo) #notice different length of genes, there are some repetitions and some missing values
## [1] 195   9
table(duplicated(gensInfo$hgnc_symbol)) #some 
## 
## FALSE  TRUE 
##   181    14
gensInfo[duplicated(gensInfo$hgnc_symbol),]#just a miRNA
##     hgnc_symbol ensembl_gene_id         chromosome_name start_position
## 2         ACACA ENSG00000278540                      17       37084992
## 10         AKT3 ENSG00000117020                       1      243488233
## 48        CLDN7 ENSG00000181885                      17        7259903
## 58        EEF2K ENSG00000103319                      16       22206278
## 80       HSPA1A ENSG00000234475 CHR_HSCHR6_MHC_DBB_CTG1       31797650
## 81       HSPA1A ENSG00000237724 CHR_HSCHR6_MHC_COX_CTG1       31802834
## 82       HSPA1A ENSG00000215328 CHR_HSCHR6_MHC_QBL_CTG1       31805699
## 83       HSPA1A ENSG00000204389                       6       31815543
## 103        MAPT ENSG00000276155      CHR_HSCHR17_1_CTG5       46069784
## 104        MAPT ENSG00000186868                      17       45894551
## 111       MYH11 ENSG00000133392                      16       15703135
## 141        PTEN ENSG00000171862                      10       87863625
## 156     RPS6KA1 ENSG00000117676                       1       26529761
## 194       YWHAE ENSG00000108953                      17        1344275
##     end_position entrezgene_id hgnc_symbol.1
## 2       37406836            31         ACACA
## 10     243851079         10000          AKT3
## 48       7263983          1366         CLDN7
## 58      22288738         29904         EEF2K
## 80      31800132          3303        HSPA1A
## 81      31805316          3303        HSPA1A
## 82      31808181          3303        HSPA1A
## 83      31817946          3303        HSPA1A
## 103     46203150          4137          MAPT
## 104     46028334          4137          MAPT
## 111     15857028          4629         MYH11
## 141     87971930          5728          PTEN
## 156     26575030          6195       RPS6KA1
## 194      1400222          7531         YWHAE
##                                                                                                            description
## 2                                                        acetyl-CoA carboxylase alpha [Source:HGNC Symbol;Acc:HGNC:84]
## 10                                                     AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]
## 48                                                                        claudin 7 [Source:HGNC Symbol;Acc:HGNC:2049]
## 58                                           eukaryotic elongation factor 2 kinase [Source:HGNC Symbol;Acc:HGNC:24615]
## 80                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 81                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 82                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 83                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 103                                              microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 104                                              microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 111                                                           myosin heavy chain 11 [Source:HGNC Symbol;Acc:HGNC:7569]
## 141                                                  phosphatase and tensin homolog [Source:HGNC Symbol;Acc:HGNC:9588]
## 156                                                 ribosomal protein S6 kinase A1 [Source:HGNC Symbol;Acc:HGNC:10430]
## 194 tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein epsilon [Source:HGNC Symbol;Acc:HGNC:12851]
##     length
## 2   321844
## 10  362846
## 48    4080
## 58   82460
## 80    2482
## 81    2482
## 82    2482
## 83    2403
## 103 133366
## 104 133783
## 111 153893
## 141 108305
## 156  45269
## 194  55947
length(setdiff(rownames(countsF), gensInfo$hgnc_symbol)) 
## [1] 1
countsFDF <- data.frame(ID=rownames(countsF),countsF)
countsFInfo <- right_join(countsFDF, gensInfo, by=c("ID"="hgnc_symbol")) 

countsFInfo <- countsFInfo[!duplicated(countsFInfo$ID),] #After having checked duplications, just keep first result

countsFInfo_backup<-countsFInfo
colnames(countsFInfo_backup)[colnames(countsFInfo_backup) == 'chromosome_name'] <- 'chr'
colnames(countsFInfo_backup)[colnames(countsFInfo_backup) == 'start_position'] <- 'start'
colnames(countsFInfo_backup)[colnames(countsFInfo_backup) == 'end_position'] <- 'end'

#Chromosome names that are missing or erroneous need to be fixed:
countsFInfo_backup[countsFInfo_backup$ID == "RPS6KA1", "chr"] <- "1"
countsFInfo_backup[countsFInfo_backup$ID == "AKT3", "chr"] <- "1"
countsFInfo_backup[countsFInfo_backup$ID == "CLDN7", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "PTEN", "chr"] <- "10"
countsFInfo_backup[countsFInfo_backup$ID == "YWHAE", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "MAPT", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "ACACA", "chr"] <- "17"
countsFInfo_backup[countsFInfo_backup$ID == "EEF2K", "chr"] <- "16"
countsFInfo_backup[countsFInfo_backup$ID == "MYH11", "chr"] <- "16"
countsFInfo_backup[countsFInfo_backup$ID == "HSPA1A", "chr"] <- "6"
countsFInfo_backup[countsFInfo_backup$ID == "CHGA", "chr"] <- "14"

countsFInfo_backup$chr<-paste0("chr", countsFInfo_backup$chr )

#To perform FPKM (for paired-end reads) or RPKM (for single-end reads), we first divide by the library size and then by gene length. 
#The sum of each sample after FPKM normalization is different.

#step 1: normalize for read depth and multiply by million
readD <- apply(countsFInfo[,2:11], 2, function(x) x / sum(x) * 10^6) 

#step 2. scale by gene length and multiply by thousand
countsFPKM <- readD / countsFInfo$length * 10^3
colSums(countsFPKM)
## TCGA.OR.A5J9.01A.11R.A29S.07 TCGA.OR.A5JE.01A.11R.A29S.07 
##                     95486.10                    134318.90 
## TCGA.OR.A5JF.01A.11R.A29S.07 TCGA.OR.A5JI.01A.11R.A29S.07 
##                    101615.03                    123874.63 
## TCGA.OR.A5K0.01A.11R.A29S.07 TCGA.OR.A5KV.01A.11R.A29S.07 
##                    111547.13                    131227.77 
## TCGA.OR.A5L5.01A.11R.A29S.07 TCGA.OR.A5LC.01A.11R.A29S.07 
##                    118024.86                    117143.21 
## TCGA.OR.A5LE.01A.11R.A29S.07 TCGA.OR.A5LL.01A.11R.A29S.07 
##                    113831.66                     74559.88
#To perform TPM, we first divide by the gene length and then we divide by the transformed sequencing depth. 
#Check that the sum of each column after TPM normalization equals to 10^6.

# sampleTF <- colSums(countsFInfo[,2:11]) 

#step 1: divide by gene length and multiply by thousand to obtain the reads per kilobase (RPK) 
rpk <- countsFInfo[,2:11] / countsFInfo$length * 10^3
#step 2: divide by sequencing depth and multiply by million
countsTPM <- apply(rpk, 2, function(x) x / sum(x) * 10^6)
#check totals (All equal to 1 million)
colSums(countsTPM)
## TCGA.OR.A5J9.01A.11R.A29S.07 TCGA.OR.A5JE.01A.11R.A29S.07 
##                        1e+06                        1e+06 
## TCGA.OR.A5JF.01A.11R.A29S.07 TCGA.OR.A5JI.01A.11R.A29S.07 
##                        1e+06                        1e+06 
## TCGA.OR.A5K0.01A.11R.A29S.07 TCGA.OR.A5KV.01A.11R.A29S.07 
##                        1e+06                        1e+06 
## TCGA.OR.A5L5.01A.11R.A29S.07 TCGA.OR.A5LC.01A.11R.A29S.07 
##                        1e+06                        1e+06 
## TCGA.OR.A5LE.01A.11R.A29S.07 TCGA.OR.A5LL.01A.11R.A29S.07 
##                        1e+06                        1e+06
#PREPARING DATAFRAME FOR LATER CNV VS. mRNA-Seq CORRELATION ANALYSIS AND MFA

countsF_TPM_LOG<-log2(countsTPM[,1:10]+2)
countsF_TPM_LOG_DF<-as.data.frame(countsF_TPM_LOG)
countsF_TPM_LOG_DF$ID<-countsFInfo_backup$ID
countsF_TPM_LOG_DF$chr<-countsFInfo_backup$chr
countsF_TPM_LOG_DF$start<-countsFInfo_backup$start
countsF_TPM_LOG_DF$end<-countsFInfo_backup$end

#PCA for mRNA-Seq
countsF_TPM_LOG_DF_PCAMFA<-countsF_TPM_LOG_DF[,1:10]
 
#Transpose
countsF_TPM_LOG_DF_PCAMFA.t<-t(countsF_TPM_LOG_DF_PCAMFA)
#Assign names, we include a exp suffix to differentiate genes from cnv
colnames(countsF_TPM_LOG_DF_PCAMFA.t)<-paste(countsF_TPM_LOG_DF$ID,"exp",sep=".")
#Construct data.frame to perform PCA
expr4pca<-data.frame(cond2,countsF_TPM_LOG_DF_PCAMFA.t)
res.pca.expr<-PCA(expr4pca,quali.sup=1)

res.pca.expr
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10 individuals, described by 182 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"
plot(res.pca.expr,habillage=1)

#We observe differences between the young and old patient samples (in dim 1 and dim2)
 
 


#FPKM and TPM account for gene length and library size per sample but do not take into account the rest of the samples 
#belonging to the experiment. There are situations in which some genes can accumulate high rates of reads. 
#To correct for these imbalance in the counts composition there are methods such as the Trimmed Mean of M-values (TMM), 
#included in the package edgeR. This normalization is suitable for comparing among the samples, for instance when performing sample 
#aggregations.

#Normalization using TMM (edgeR package)
d <- DGEList(counts = countsF)
Norm.Factor <- calcNormFactors(d, method = "TMM")
countsTMM <- cpm(Norm.Factor, log = T)

countsTMMnoLog <- cpm(Norm.Factor, log = F) 
#Observing how distribution of the three normalizations (in log2) change (for the first sample).

hist(log2(countsFPKM[,1]+2), xlab="log2-ratio", main="FPKM")

#Appears to be a normal distribution of log2-ratios
hist(log2(countsTPM[,1]+2), xlab="log2-ratio", main="TPM") 

#Appears to be a normal distribution of log2-ratios

#We will later need gene ID to be included to filtered, TPM-normalized, log-transformed mRNA=seq counts matrix 
#For future mRNA-seq vs. GISTIC CNV correlation analysis: 

hist(countsTMM[,1], xlab="log2-ratio", main="TMM") 

#Appears to be a normal distribution of log2-ratios

#To see how samples aggregate, we will perform hierarchical clustering as well as PCA. 
#The purpose is to see whether samples aggregate by condition or there are some outliers, that might have a biological or technical causes.

#Hierarchical clustering
x_rna<-countsTMM

#Euclidean distance
clust.cor.ward <- hclust(dist(t(x_rna)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)

#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

clust.cor.average <- hclust(dist(t(x_rna)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)

#The average hierarchal clustering DOES NOT appear to reflect the segregation of 5 old and 5 young patients

clust.cor.average <- hclust(dist(t(x_rna)),method="complete")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)

#The complete hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

#Correlation based distance
clust.cor.ward <- hclust(as.dist(1-cor(x_rna)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)

#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

clust.cor.average<- hclust(as.dist(1-cor(x_rna)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8) 

#The average hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

#Data Preparation
cond2<-phenoN$age.status
countsF_backup<-as.matrix(countsF)
sum1<-sum(is.na(countsF_backup))
sum1
## [1] 0
#Density plot of raw read counts (log10)
countsSF_backup_log <- log(countsF_backup,10) 
d <- density(countsSF_backup_log)
plot(d,xlim=c(1,8),main="",ylim=c(0,.45),xlab="Raw filtered read counts per gene after log10 transformation)", ylab="Density")
for (s in 1:length(colnames(countsSF_backup_log))){
  countsSF_backup_log <- log(countsF_backup[,s],10) 
  d <- density(countsSF_backup_log)
  lines(d)
}

#Box plots of raw filtered read counts after log10 transformation
countsSF_backup_log <- log(countsF_backup,10)
boxplot(countsSF_backup_log , main="", xlab="", ylab="Raw read counts per gene after log10 transformation)",axes=FALSE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
axis(2)
axis(1,at=c(1:length(colnames(countsSF_backup_log))),labels=colnames(countsSF_backup_log),las=2,cex.axis=0.8)

#Plot Heatmap with condition age.status as labels
colnames(countsF_backup)<-phenoN$age.status 
heatmap(countsF_backup, col = topo.colors(50), margin=c(10,6))

#Heatmap reveals that Old patients were relatively underexpressing more mRNA genes

# PCA
#library ggfortify needed for the autoplot to understand and plot PCA results
summary(pca.filt <- prcomp(t(x_rna), scale=T )) 
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     6.7546 5.9800 4.8424 4.4924 4.15924 3.62842 3.47849
## Proportion of Variance 0.2507 0.1965 0.1288 0.1109 0.09505 0.07234 0.06648
## Cumulative Proportion  0.2507 0.4472 0.5760 0.6869 0.78196 0.85429 0.92078
##                           PC8     PC9      PC10
## Standard deviation     3.0315 2.28660 1.072e-14
## Proportion of Variance 0.0505 0.02873 0.000e+00
## Cumulative Proportion  0.9713 1.00000 1.000e+00
autoplot(pca.filt, data=phenoN, colour="patientID", shape="age.status")

#There does not appear to be segregation by age status
#Note that a total of 25.07%+ 19.65%=44.72% variance is accounted for by the 
#first 2 principal components PC1 and PC2 and corresponding eigenvector values

#RNA-seq are represented by counts matrices and therefore linear models like those implemented in limma 
#cannot be directly applied. There are several options we can take:

#1.Transform counts matrices and apply limma
#2.Use specific methods that account for count data distribution
#The voom transformation is used for the first limma approach and DESeq2 accounting 
#for Negative Binomial distribution of the data is used in second approach.
#The limma approach for RNA-seq converts read counts to log2-counts-per-million (logCPM) and the mean-variance relationship 
#is modeled either with precision weights (the voom approach) or with an empirical Bayes prior trend (the limma-trend approach).
#Voom estimates the mean-variance relationship of the log-counts and creates weights that are later on used by limma.
#Applying the voom transformation and the limma model to perform differentially expressed genes using variable cond.

 
cond2<-phenoN$age.status
design <- model.matrix(~0+cond2)
rownames(design) <- phenoN$sample
colnames(design) <- gsub("cond2", "", colnames(design))

voom.res <- voom(countsF, design, plot = T) 

#Model fit
fit <- lmFit(voom.res, design) 

#contrasts
contrast.matrix <- makeContrasts(con1=old-young,levels = design) 

#contrasts fit and Bayesian adjustment
fit2 <- contrasts.fit(fit, contrast.matrix)
fite <- eBayes(fit2)

#summary 
summary(decideTests(fite, method = "separate"))
##        con1
## Down      0
## NotSig  182
## Up        0
#In case we cannot adjust for multiple comparisons, not advisable
summary(decideTests(fite, adjust.method = "none", method = "separate")) 
##        con1
## Down      4
## NotSig  173
## Up        5
#global model
top.table <- topTable(fite, number = Inf, adjust = "fdr")
#Now study how p-values behave. Under the null hypothesis, p-values are expected to have a uniform distribution.

hist(top.table$P.Value, breaks = 100, main = "results P")

#No significant results were obtained at FDR < 0.05 and the distribution of p-values 
#shows that there is some variability that was not considered in the model. 
#We do not later include other colData (multiassay experimental) variables  in the model to see whether results improve.

#DESeq2 on SUMMARIZED EXPERIMENT:
#As input, the DESeq2 package expects raw count data in the form of a matrix of integer values. 
#The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts 
#scaled by library size should not be used as input. The estimates of dispersion and logarithmic fold changes incorporate data-driven prior distributions.
#ddsSE <- DESeqDataSet(mACC.exp3, design = ~ colnames(mACC.exp3))
#ddsSE
#filtering
#keep <- rowSums(counts(ddsSE) >= 10) >= 5
#ddsSE <- ddsSE[keep,]

sum_na<-sum(is.na(countsF))

#DESeq2 on COUNT MATRIX:
#Filtering is also advised by DESeq2, so we will create the DESeqDataSet from the filtered counts matrix.
countsF_int<-countsF
object.size(countsF_int)
## 27640 bytes
mode(countsF_int) <- "integer"
object.size(countsF_int)
## 20360 bytes
dds <- DESeqDataSetFromMatrix(countData = countsF_int,colData = phenoN,design = ~ age.status) 
#To benefit from the default settings of the package, you should put the variable of interest at 
#the end of the formula and make sure the control level is the first level. This is not necessary if contrast option is used as here
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Global model
resG <- results(dds, alpha=0.05) #lfcThreshold is by default 0
summary(resG)
## 
## out of 182 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 0, 0%
## LFC < 0 (down)     : 0, 0%
## outliers [1]       : 2, 1.1%
## low counts [2]     : 0, 0%
## (mean count < 25)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#Contrasts, we just check two of them
res1 <- results(dds, contrast=c("age.status","old","young"))
summary(res1)
## 
## out of 182 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 3, 1.6%
## LFC < 0 (down)     : 2, 1.1%
## outliers [1]       : 2, 1.1%
## low counts [2]     : 0, 0%
## (mean count < 25)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
res1DF <- as.data.frame(res1)
res1DFS <- res1DF[order(res1DF$pvalue),]
res1DFSign <- res1DFS[!is.na(res1DFS$pvalue) & res1DFS$pvalue<0.05, ]
res1DFSign
##           baseMean log2FoldChange     lfcSE      stat       pvalue       padj
## ITGA2    1327.6951      2.7360301 0.7603356  3.598451 0.0003201186 0.05153136
## TGM2     2026.5741      1.7482773 0.5176150  3.377563 0.0007313117 0.05153136
## CDKN2A    489.3470     -2.1809438 0.6543299 -3.333095 0.0008588560 0.05153136
## NRAS      642.2827     -1.4122025 0.4652978 -3.035051 0.0024049521 0.09536657
## ASNS      541.1699      1.3216405 0.4397008  3.005772 0.0026490714 0.09536657
## EGFR      255.6320      2.2555207 0.8678233  2.599055 0.0093480731 0.24120018
## XBP1     2126.5317      1.2390244 0.4769359  2.597884 0.0093800071 0.24120018
## SYK       283.9256      2.8631917 1.1340265  2.524801 0.0115763723 0.25844433
## ADAR     7898.8241      1.7509197 0.7043387  2.485906 0.0129222165 0.25844433
## TSC2     1454.1945      0.4284242 0.1777503  2.410259 0.0159412085 0.27142621
## MAPK9    1224.1549      0.8624657 0.3600007  2.395733 0.0165871575 0.27142621
## SHC1     4507.9093      1.4601221 0.6344295  2.301473 0.0213649360 0.31025225
## RAD50    1365.3078      1.0988116 0.4854012  2.263718 0.0235914522 0.31025225
## FASN     6331.0574     -1.8195275 0.8121507 -2.240382 0.0250661595 0.31025225
## SERPINE1 1642.4591      1.8487470 0.8296326  2.228392 0.0258543541 0.31025225
## PIK3R1   1677.0962      1.9490034 0.9172863  2.124749 0.0336075470 0.37808490
## AKT1S1   2463.3840     -0.9591121 0.4677636 -2.050421 0.0403234056 0.42695371
## YBX1     5622.6716     -0.9009675 0.4514227 -1.995840 0.0459513352 0.43659251
#From DESeq2 model, there are 3 statistically differentially overexpressed (ITGA2, TGM2, ASNS)
#and 2 statistically differentially underexpressed genes (CDKN2A, NRAS) identified:

#Results of limma and DESeq2 can be visualized using volcano plots and heatmaps. 
#We will just create plots for the first contrast.

#Volcano plot

colorS <- c("blue", "grey", "red")
#CHECK p or p.adj

#specific parameters
showGenes <- 20 #genes to be displayed with names

dataV <- topTable(fite, n = Inf, coef = "con1", adjust = "fdr")
dataV <- dataV %>% mutate(gene = rownames(dataV), logp = -(log10(P.Value)), logadjp = -(log10(adj.P.Val)),
                          FC = ifelse(logFC>0, 2^logFC, -(2^abs(logFC)))) %>%
  mutate(sig = ifelse(P.Value<0.01 & logFC > 1, "UP", ifelse(P.Value<0.01 & logFC < (-1), "DN","n.s"))) #ideally we should have an adj.P.Val < 0.05

p <- ggplot(data=dataV, aes(x=logFC, y=logp )) +
  geom_point(alpha = 1, size= 1, aes(col = sig)) + 
  scale_color_manual(values = colorS) +
  xlab(expression("log"[2]*"FC")) + ylab(expression("-log"[10]*"(p.val)")) + labs(col=" ") + 
  geom_vline(xintercept = 1, linetype= "dotted") + geom_vline(xintercept = -1, linetype= "dotted") + 
  geom_hline(yintercept = -log10(0.1), linetype= "dotted")  +  theme_bw()

p <- p + geom_text_repel(data = head(dataV[dataV$sig != "n.s",],showGenes), aes(label = gene)) 

print(p)

#Evidently, expression of gene CDKN2A is significantly downregulated and expression of gene TGM2 is upregulated
#as a function of age (young/old)

#Heatmap
#Plotting heatmap results for the limma model (without adjusting for variable patientID).

t1 <- topTable(fite, n = Inf, coef = "con1", adjust = "fdr")
res1 <- t1[t1$P.Value<0.01 & abs(t1$logFC) > 1,]

data.clus <- countsTMM[rownames(res1),]

cond2.df <- as.data.frame(cond2)
rownames(cond2.df) <- colnames(data.clus)
pheatmap(data.clus, scale = "row", show_rownames = TRUE, annotation_col = cond2.df)

#Evidently, TGM2 is overexpressed in old patients and underexpressed in young patients
#CDKN2A is overexpressed in young patients. CDKN2A is abberantly downregulated in the Old Patient A5LC 

#GENE ANNOTATION AND GENE ONTOLOGY:

#Load the library
#The central ID for org.Hs.eg.db, a genome-wide annotation for humans based on Entrez Gene, is the NCBI Gene ID.
#org.Hs.egACCNUM is an R object that contains mappings between Entrez Gene identifiers and
#GenBank accession numbers.

#Define list of 5 genes of interest (DE genes - EntrezGene IDs)
gene_entrez1<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[1],16]#OVER
gene_entrez2<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[2],16]#OVER
gene_entrez3<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[3],16]#UNDER
gene_entrez4<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[4],16]#UNDER
gene_entrez5<-countsFInfo[countsFInfo$ID == rownames(res1DFSign)[5],16]#OVER
gene_entrez_total_OVER<-as.character(c(gene_entrez1,gene_entrez2,gene_entrez5))
gene_entrez_total_OVER 
## [1] "3673" "7052" "440"
gene_entrez_total_UNDER<-as.character(c(gene_entrez3,gene_entrez4))
gene_entrez_total_UNDER
## [1] "1029" "4893"
# Define the universe as all the BioMart-obtained ENTREZ GENE IDs corresponding to our non-duplicated mRNA genes
universeids <- as.character(countsFInfo[,16])
length(universeids)
## [1] 181
#Before running the hypergeometric test with the hyperGTest function, we need to define the parameters
#for the test (gene lists, ontology, test direction) as well as the annotation database to be used. 
#The ontology to be tested can be any of the three GO domains: biological process (“BP”), cellular component (“CC”) or molecular function (“MF”).
#We will test for over-represented biological processes in our list of differentially expressed genes.

# define the p-value cut off for the hypergeometric test
hgCutoff <- 0.05

#Conducting test for overexpressed genes
params_over <- new("GOHyperGParams",annotation="org.Hs.eg",geneIds=gene_entrez_total_OVER ,universeGeneIds=universeids,ontology="BP",pvalueCutoff=hgCutoff,testDirection="over")

#Run the test
hg_over <- hyperGTest(params_over)
# Check results
hg_over
## Gene to GO BP  test for over-representation 
## 425 GO BP ids tested (89 have p < 0.05)
## Selected gene set size: 3 
##     Gene universe size: 181 
##     Annotation package: org.Hs.eg
#Get the output table from the test for significant GO terms only by adjusting the pvalues with the p.adjust function.
#Get the p-values of the test
hg.pv_over <- pvalues(hg_over)
## Adjust p-values for multiple test (FDR)
hg.pv.fdr_over <- p.adjust(hg.pv_over,'fdr')
## select the GO terms with adjusted p-value less than the cut off
#sigGO.ID <- names(hg.pv.fdr[hg.pv.fdr < hgCutoff])
#select the GO terms with NON-adjusted p-value less than the cut off
sigGO.ID_over <- names(hg.pv_over[pvalues(hg_over) < hgCutoff])
length(sigGO.ID_over)
## [1] 89
#get table from HyperG test result
df_over <- summary(hg_over)
# keep only significant GO terms in the table
GOannot.table_over <- df_over[df_over[,1] %in% sigGO.ID_over,]
head(GOannot.table_over)
##       GOBPID       Pvalue OddsRatio   ExpCount Count Size
## 1 GO:0050764 0.0005504285 354.00000 0.04972376     2    3
## 2 GO:0030100 0.0064569894  48.85714 0.14917127     2    9
## 3 GO:0006909 0.0137761454  30.36364 0.21546961     2   13
## 4 GO:0006528 0.0165745856       Inf 0.01657459     1    1
## 5 GO:0006529 0.0165745856       Inf 0.01657459     1    1
## 6 GO:0006541 0.0165745856       Inf 0.01657459     1    1
##                              Term
## 1      regulation of phagocytosis
## 2       regulation of endocytosis
## 3                    phagocytosis
## 4    asparagine metabolic process
## 5 asparagine biosynthetic process
## 6     glutamine metabolic process
#Evidently, our statistically differentially overexpressed protein-coding genes are 
#associated with phago-and endo-cytosis and asparagine-glutamine metabolic processes

#VISUALIZATION OF mRNA-Seq Gene Expression 

#SUBSET LIST OF ANNOTATED mRNA GENES THAT ARE SIGNIFICANTLY DGE BETWEEN OLD AND YOUNG PATIENTS WITH CORRESPONDING GENE POSITION COORDINATES AND CHROMOSOMES:

countsFInfo_sig<-countsFInfo[countsFInfo$ID %in% rownames(res1DFSign),]
countsFInfo_sig<-countsFInfo_sig[,c("ID", "chromosome_name", "start_position", "end_position")]
countsFInfo_sig
##           ID chromosome_name start_position end_position
## 8     AKT1S1              19       49869033     49878459
## 17      ASNS               7       97851677     97872542
## 26      NRAS               1      114704469    114716771
## 51     MAPK9               5      180233143    180292099
## 55     ITGA2               5       52989340     53094779
## 59      ADAR               1      154582057    154628013
## 61      EGFR               7       55019017     55211628
## 63      FASN              17       82078338     82098294
## 66  SERPINE1               7      101127104    101139247
## 77      TSC2              16        2047967      2089491
## 80      YBX1               1       42682418     42703805
## 89      SHC1               1      154962298    154974395
## 106     TGM2              20       38127385     38166578
## 122    RAD50               5      132556019    132646349
## 128   PIK3R1               5       68215756     68301821
## 131     XBP1              22       28794555     28800597
## 182      SYK               9       90801787     90898549
## 192   CDKN2A               9       21967752     21995301
#ID chromosome_name start_position end_position
#8     AKT1S1              19       49869033     49878459
#17      ASNS               7       97851677     97872542
#26      NRAS               1      114704469    114716771
#51     MAPK9               5      180233143    180292099
#55     ITGA2               5       52989340     53094779
#59      ADAR               1      154582057    154628013
#61      EGFR               7       55019017     55211628
#63      FASN              17       82078338     82098294
#66  SERPINE1               7      101127104    101139247
#77      TSC2              16        2047967      2089491
#80      YBX1               1       42682418     42703805
#89      SHC1               1      154962298    154974395(154974376?) 
#106     TGM2              20       38127385     38166578
#122    RAD50               5      132556019    132646349
#128   PIK3R1               5       68215756     68301821
#131     XBP1              22       28794555     28800597
#182      SYK               9       90801787     90898549
#192   CDKN2A               9       21967752     21995301

#Further Confirmed via NCBI Website, the combined significant DGE genes have the following chromosomal genomic positions"
#Chromosomes 1,5,7 have multiple DGE genes

#GVIZ VISUALIZATION OF mRNA-Seq Gene Expression for CDKN2A gene on chromosome 9:
#Gviz displays information of a genomic region in a specific chromosome. It works with tracks, that need to be defined. 
#The virtual parent class for all track items in the package is the GdObject class.  This class definition contains all the common 
#entities that are needed for a track to be plotted.
#There are constructor functions for each track as well as a broad range of methods to interact with and to plot them. 
#Once the tracks defined, we can use function plotTracks() to plot them. We will introduce the basic tracks.
mRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[2]]
rowRanges(mRNA_expr)
## GRanges object with 195 ranges and 1 metadata column:
##          seqnames              ranges strand |     gene_id
##             <Rle>           <IRanges>  <Rle> | <character>
##   DIRAS3        1   68511645-68516481      - |        9077
##   MAPK14        6   35995454-36079013      + |        1432
##     YAP1       11 101981192-102104154      + |       10413
##   CDKN1B       12   12870302-12875305      + |        1027
##    ERBB2       17   37844393-37884915      + |        2064
##      ...      ...                 ...    ... .         ...
##    MACC1        7   20174279-20257013      - |      346389
##     CHGA       14   93389445-93401638      + |        1113
##    IDH3A       15   78441719-78462884      + |        3419
##   SQSTM1        5 179233388-179265077      + |        8878
##   KCNJ13        2 233630512-233641275      - |        3769
##   -------
##   seqinfo: 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19)
#Already a GRanges Object
#mRNA_expr.gr<-unlist(rowRanges(mRNA_expr))#from a GRangesList to a GRanges object?  
mRNA_expr.gr<-rowRanges(mRNA_expr)
table(seqnames(mRNA_expr.gr))
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   22   11   13    7    9    5    9    8    9    8   10   11    2    2    7    7 
##   17   18   19   20   21   22    X    Y chrM 
##   13    4   16    8    2    6    6    0    0
#Despite lack of Y chromosomal genes, the gender of patients confirmed as follows:
colData(miniACC.assays.comp.age)$gender
##  [1] "female" "female" "female" "male"   "female" "female" "female" "female"
##  [9] "male"   "female"
mRNA_expr.9<-mRNA_expr.gr[seqnames(mRNA_expr.gr)=='9',]
mRNA_expr.9<-keepSeqlevels(mRNA_expr.9,"9") #to remove undesired levels
exprs.9<-assays(mRNA_expr)$exprs[names(mRNA_expr.9),]
head(exprs.9)
##        TCGA-OR-A5J9-01A-11R-A29S-07 TCGA-OR-A5JE-01A-11R-A29S-07
## NOTCH1                     558.5875                     188.3484
## TSC1                       948.5448                     913.4615
## LCN2                         0.0000                       1.6968
## RPS6                      6238.2972                   31138.0090
## TTF1                       316.1816                     243.2127
## PTCH1                      371.2584                     169.6833
##        TCGA-OR-A5JF-01A-11R-A29S-07 TCGA-OR-A5JI-01A-11R-A29S-07
## NOTCH1                     381.5040                     566.1425
## TSC1                      1076.1657                     833.4875
## LCN2                         6.1782                       6.4755
## RPS6                     22022.5891                   27255.3191
## TTF1                       291.1478                     214.6161
## PTCH1                      311.2270                     128.5846
##        TCGA-OR-A5K0-01A-11R-A29S-07 TCGA-OR-A5KV-01A-11R-A29S-07
## NOTCH1                     239.9577                     180.1003
## TSC1                       700.8457                    1015.4261
## LCN2                        33.2981                       0.0000
## RPS6                     12192.3890                   37627.8442
## TTF1                       320.2960                     251.0605
## PTCH1                      162.7907                      52.4489
##        TCGA-OR-A5L5-01A-11R-A29S-07 TCGA-OR-A5LC-01A-11R-A29S-07
## NOTCH1                     507.9192                     343.9620
## TSC1                       807.7553                     601.7639
## LCN2                      1868.9241                     378.5617
## RPS6                     29292.7362                   21307.3270
## TTF1                       197.1600                     221.1669
## PTCH1                      209.7215                     334.4640
##        TCGA-OR-A5LE-01A-11R-A29S-07 TCGA-OR-A5LL-01A-11R-A29S-07
## NOTCH1                      93.4855                     136.9822
## TSC1                       868.4314                    2008.6735
## LCN2                         2.4601                      82.5482
## RPS6                     15172.9482                   14326.3048
## TTF1                       296.4475                     281.1425
## PTCH1                      101.4810                     145.9548
chr2 <- "chr9"
geno2 <- "hg19"
atrack2 <- AnnotationTrack(mRNA_expr.9, name = "mRNA-Seq for Gene CDKN2A")
gtrack2 <- GenomeAxisTrack() 
itrack2 <- IdeogramTrack(gen = geno2, chromosome = chr2) 

#We choose to set a from and a to in the plotTracks to delimitate the region
dtrack2 <- DataTrack(data = t(exprs.9), start=start(mRNA_expr.9), end=end(mRNA_expr.9),chromosome = chr2, genome = geno2,name = "mRNA-Seq for Gene CDKN2A")
plotTracks(list(gtrack2, atrack2, itrack2,dtrack2),from=20000000,to=25000000,type="heatmap", col="blue") #dot plot

#data(geneModels) #data.frame containing 97 genes at chromosome 7 
#head(geneModels)
#str(geneModels)
#grtrack <- GeneRegionTrack(geneModels, genome = genome(mRNA_expr),chromosome = as.character(unique(seqnames(mRNA_expr))),name = "Gene Model", transcriptAnnotation = "symbol", background.title = "brown")
#head(displayPars(grtrack))
#itrack <- IdeogramTrack(genome = "hg19", chromosome = "chr7") 
#We choose to set a from and a to in the plotTracks to delimitate the region
#dtrack <- DataTrack(data = t(exprs.7), start=start(mRNA_expr.7), end=end(mRNA_expr.7),chromosome = as.character(unique(seqnames(mRNA_expr))), genome = genome(mRNA_expr),name = "mRNA-Seq for Chromosome 7")
#The sequence track adds the genomic sequences of nucleotides, when needed.
#strack <- SequenceTrack(Hsapiens, chromosome = as.character(unique(seqnames(mRNA_expr))))
#delimit the region
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack,strack), from = 26591822, to = 26591852, cex = 0.8)
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852)
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "histogram")
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "l")
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "heatmap", legend=T) 
#plotTracks(list(itrack,gtrack, atrack, grtrack,dtrack), from = 26591822, to = 26591852,type = "boxplot") 

#CIRCOS VISUALIZATION:

options(stringsAsFactors = FALSE) #important argument to keep control of factors, otherwise colors are lost in OmicCircos
seqinfo(mRNA_expr)
## Seqinfo object with 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19):
##   seqnames seqlengths isCircular     genome
##   1         249250621       <NA> GRCh37.p13
##   2         243199373       <NA> GRCh37.p13
##   3         198022430       <NA> GRCh37.p13
##   4         191154276       <NA> GRCh37.p13
##   5         180915260       <NA> GRCh37.p13
##   ...             ...        ...        ...
##   21         48129895       <NA> GRCh37.p13
##   22         51304566       <NA> GRCh37.p13
##   X         155270560       <NA> GRCh37.p13
##   Y          59373566       <NA> GRCh37.p13
##   chrM          16571       TRUE       hg19
range(assays(mRNA_expr)$"exprs")
## [1]      0.0 206162.3
rr.df<-as.data.frame(rowRanges(mRNA_expr))
rna<-assays(mRNA_expr)$"exprs"
 

#filtering
SD <-apply(rna,1,sd)
cbind(quantiles <-quantile(SD, probs = seq(0, 1, 0.01)))
##              [,1]
## 0%   3.342844e-01
## 1%   4.607587e-01
## 2%   6.208587e+00
## 3%   1.064034e+01
## 4%   1.513817e+01
## 5%   2.092353e+01
## 6%   2.366263e+01
## 7%   2.719731e+01
## 8%   2.995949e+01
## 9%   3.743964e+01
## 10%  4.224227e+01
## 11%  4.661433e+01
## 12%  7.394512e+01
## 13%  7.905309e+01
## 14%  9.587964e+01
## 15%  1.005815e+02
## 16%  1.170762e+02
## 17%  1.321343e+02
## 18%  1.326502e+02
## 19%  1.365523e+02
## 20%  1.730714e+02
## 21%  1.772702e+02
## 22%  1.886710e+02
## 23%  2.053126e+02
## 24%  2.211072e+02
## 25%  2.349678e+02
## 26%  2.488232e+02
## 27%  2.524245e+02
## 28%  2.606581e+02
## 29%  2.770464e+02
## 30%  3.122693e+02
## 31%  3.533687e+02
## 32%  3.639844e+02
## 33%  3.678300e+02
## 34%  3.780775e+02
## 35%  3.791259e+02
## 36%  3.820264e+02
## 37%  3.865365e+02
## 38%  3.882507e+02
## 39%  3.905082e+02
## 40%  3.941259e+02
## 41%  3.994727e+02
## 42%  4.136960e+02
## 43%  4.276093e+02
## 44%  4.440190e+02
## 45%  4.644401e+02
## 46%  4.853544e+02
## 47%  4.902076e+02
## 48%  5.038344e+02
## 49%  5.069006e+02
## 50%  5.094696e+02
## 51%  5.399463e+02
## 52%  5.487840e+02
## 53%  5.562187e+02
## 54%  5.835128e+02
## 55%  6.074455e+02
## 56%  6.229177e+02
## 57%  6.449953e+02
## 58%  6.613223e+02
## 59%  6.983339e+02
## 60%  7.195033e+02
## 61%  7.358404e+02
## 62%  7.620481e+02
## 63%  7.805220e+02
## 64%  8.047014e+02
## 65%  8.131532e+02
## 66%  8.355629e+02
## 67%  8.781700e+02
## 68%  8.843951e+02
## 69%  9.000901e+02
## 70%  9.621060e+02
## 71%  1.106660e+03
## 72%  1.169666e+03
## 73%  1.255653e+03
## 74%  1.296733e+03
## 75%  1.329923e+03
## 76%  1.407456e+03
## 77%  1.496389e+03
## 78%  1.577808e+03
## 79%  1.602427e+03
## 80%  1.655300e+03
## 81%  1.748514e+03
## 82%  1.872036e+03
## 83%  1.965576e+03
## 84%  2.149515e+03
## 85%  2.306744e+03
## 86%  2.513429e+03
## 87%  2.607063e+03
## 88%  2.911394e+03
## 89%  3.109756e+03
## 90%  3.722222e+03
## 91%  4.363940e+03
## 92%  5.108326e+03
## 93%  5.382823e+03
## 94%  5.692236e+03
## 95%  5.832842e+03
## 96%  6.880619e+03
## 97%  7.146997e+03
## 98%  9.846067e+03
## 99%  1.611046e+04
## 100% 5.167708e+04
rna.f<-rna[SD>quantiles["98%"],]

rr.df.f<-rr.df[rownames(rna.f),]
T.rr<-data.frame("chr"=rr.df.f$seqnames,"Start"=as.integer(rr.df.f$start),"End"=as.integer(rr.df.f$end),rna.f,row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=380, cir="hg19", W=4,   type="chr", print.chr.lab=T, scale=T);
circos(R=320, cir="hg19", W=50,  mapping=T.rr,   col.v=4,    type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");

#checkout scale, consider transforming it
range(rna.f) 
## [1]   2476.411 206162.330
#Perform log transformation with an offset (as log(0)->-Inf))
T.rr<-data.frame("chr"=rr.df.f$seqnames,"Start"=as.integer(rr.df.f$start),"End"=as.integer(rr.df.f$end),log2(rna.f+1),row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=400, cir="hg19", W=4,   type="chr", print.chr.lab=T, scale=T);
circos(R=340, cir="hg19", W=50,  mapping=T.rr,   col.v=4,    type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");

#GGBIO VISUALIZATION OF CHROMOSOME#1 GENES NRAS, ADAR,SHC1, and YBX1 mRNA-Seq GENE EXPRESSION:

#Ideogram
p.ideo <- Ideogram(genome = "hg19")
## use chr1 automatically
p.ideo

data(genesymbol, package = "biovizBase")
genesymbol #GRanges object
## GRanges object with 29177 ranges and 2 metadata columns:
##                seqnames              ranges strand |       symbol
##                   <Rle>           <IRanges>  <Rle> |  <character>
##           A1BG    chr19   58858174-58864865      - |         A1BG
##            A2M    chr12     9220304-9268558      - |          A2M
##           NAT1     chr8   18027971-18081197      + |         NAT1
##           NAT1     chr8   18067618-18081197      + |         NAT1
##           NAT1     chr8   18079177-18081197      + |         NAT1
##            ...      ...                 ...    ... .          ...
##   LOC100499405    chr12     9392599-9395645      + | LOC100499405
##   LOC100499467    chr17   70399463-70588943      - | LOC100499467
##       C9orf174     chr9 100069910-100139575      + |     C9orf174
##   LOC100499484     chr9 100000708-100059594      + | LOC100499484
##   LOC100499489    chr10   22724354-22726858      - | LOC100499489
##                     ensembl_id
##                    <character>
##           A1BG ENSG00000121410
##            A2M ENSG00000175899
##           NAT1 ENSG00000171428
##           NAT1 ENSG00000171428
##           NAT1 ENSG00000171428
##            ...             ...
##   LOC100499405            <NA>
##   LOC100499467            <NA>
##       C9orf174 ENSG00000197816
##   LOC100499484            <NA>
##   LOC100499489            <NA>
##   -------
##   seqinfo: 45 sequences from an unspecified genome; no seqlengths
# select just some symbols
wh <- genesymbol[c("NRAS","ADAR","SHC1")]
# define the range
wh <- range(wh, ignore.strand = TRUE)

# gene model track from OrganismDb object, could also be created from 
# TxDb object GRangesList object or EnsDb object
p.genes <- autoplot(Homo.sapiens, which = wh)
## Parsing transcripts...
## Parsing exons...
## Parsing cds...
## Parsing utrs...
## ------exons...
## ------cdss...
## ------introns...
## ------utr...
## aggregating...
## Done
## 'select()' returned 1:1 mapping between keys and columns
## Constructing graphics...
p.genes
## Warning: Removed 234 rows containing missing values or values outside the scale range
## (`geom_text()`).

#plot bam files, containing alignments, extracted from the biovizBase package
#bamfile <- system.file("extdata", "SRX21981997subADAR.bam", package="biovizBase")
#wh <- keepSeqlevels(wh, "chr1")
#bg <- BSgenome.Hsapiens.UCSC.hg19
#p.mis <- autoplot(bamfile, bsgenome = bg, which = wh, stat = "mismatch") #mismatches in the alignments, by nucleotide
#p.mis
#tracks() to bind previously generated plots
#gr1 <- GRanges("chr1", IRanges(114704469, 154974376))
#tks <- tracks(p.ideo, gene = p.genes, mismatch = p.mis, heights = c(2, 10,3)) + xlim(gr1) 
#tks
#Another theme to plot
#tks + theme_tracks_sunset()

miRNA-Seq DATA BLOCK ANALYSIS

#Preliminary analysis of individual extracted miRNA-seq Summarized Experiment:

#microRNAs (miRNAs) are short (20-24 nt) non-coding RNAs that are involved in post-transcriptional regulation of gene expression 
#in multicellular organisms by affecting both the stability and translation of mRNAs. miRNAs are transcribed by RNA polymerase II 
#as part of capped and polyadenylated primary transcripts (pri-miRNAs) that can be either protein-coding or non-coding. 
#The primary transcript is cleaved by the Drosha ribonuclease III enzyme to produce an approximately 70-nt stem-loop precursor miRNA (pre-miRNA),
#which is further cleaved by the cytoplasmic Dicer ribonuclease to generate the mature miRNA and antisense miRNA star (miRNA*) products. 
#The mature miRNA is incorporated into a RNA-induced silencing complex (RISC), which recognizes target mRNAs through imperfect base pairing 
#with the miRNA and most commonly results in translational inhibition or destabilization of the target mRNA. 
#The RefSeq represents the predicted microRNA stem-loop.

#Creating a phenotype dataframe for mRNA expression:
phenoN_micro <- data.frame(sample=colnames(mACC.mir.c3),patientID=colData(miniACC.assays.comp.age)$patientID, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
rownames(phenoN_micro)<-phenoN_micro$sample 
countsM_micro <- as.matrix(assays(mACC.mir3)$exprs)
 
#The GENE IDs appear to be HGNC: Official Symbol. MIRLET7A1 provided by HGNC
#Official Full Name microRNA let-7a-1 provided by HGNC, for example.
#This would suggest that over 50% of genes are under microRNA regulation.
#https://www.ncbi.nlm.nih.gov/gene/406881
#https://www.ensembl.org/biomart/martview/bcd31ecb53c27f25ed8176ab4dfef813

sum(is.na(countsM_micro))
## [1] 0
#As part of the exploration, we plot data
boxplot(countsM_micro) #They didn't apply log2 on the TMM for transformation

#Fifth and Last sample appears to have outliers
boxplot(log2(countsM_micro+2))

#Check Library size
lSize_micro <- colSums(countsM_micro)
lSize_micro #all sample sums > 1M (not = 1M as expected for TMM normalization) and non-homogeneous 
## TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13 
##                      4541066                      5125120 
## TCGA-OR-A5JF-01A-11R-A29W-13 TCGA-OR-A5JI-01A-11R-A29W-13 
##                      5098006                      6600740 
## TCGA-OR-A5K0-01A-11R-A29W-13 TCGA-OR-A5KV-01A-11R-A29W-13 
##                      6624927                      2408786 
## TCGA-OR-A5L5-01A-11R-A29W-13 TCGA-OR-A5LC-01A-11R-A29W-13 
##                      3018597                      4371030 
## TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13 
##                      9484599                      6751885
#We study total of reads per sample (library size).
sampleT_micro <- apply(countsM_micro, 2, sum)/10^6
range(sampleT_micro)
## [1] 2.408786 9.484599
sampleTDF_micro <- data.frame(sample=names(sampleT_micro), total=sampleT_micro)

p <- ggplot(aes(x=sample, y=sampleT_micro, fill=sampleT_micro), data=sampleTDF_micro) + geom_bar(stat="identity")
p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("")

#Evidently, sample 6 and 7 have relatively fewer reads and sample 8 has most reads
 
#Our "old" group and "young" group  have 5 samples each (10 patients, 10 samples total)
keep_micro <- rowSums(countsM_micro > 10) >= 5 # at least 5 samples have 10 reads per gene
countsF_micro <- countsM_micro[keep_micro,]
 
#ensembl_reg_102<-useEnsembl(biomart = 'regulation', dataset = 'hsapiens_gene_ensembl',version = 102)
#getAnnotation(mart,featureType = c("TSS", "miRNA", "Exon", "5utr", "3utr", "ExonPlusUtr", "transcript"))
searchFilters(mart = ensembl102, pattern = "miRBase")
##                       name
## 32            with_mirbase
## 33 with_mirbase_trans_name
## 91       mirbase_accession
## 92              mirbase_id
## 93      mirbase_trans_name
##                                                description
## 32                                      With miRBase ID(s)
## 33                      With miRBase transcript name ID(s)
## 91                   miRBase accession(s) [e.g. MI0000060]
## 92                       miRBase ID(s) [e.g. hsa-let-7a-1]
## 93 miRBase transcript name ID(s) [e.g. hsa-mir-1253.1-201]
gensInfo_micro = getBM(c("mirbase_id","ensembl_gene_id","chromosome_name", "start_position","end_position", "entrezgene_id","hgnc_symbol","description"), filters=c("mirbase_id", "with_mirbase"), values=list(rownames(countsF_micro), TRUE), mart=ensembl102)
gensInfo_micro$length <- gensInfo_micro$end_position - gensInfo_micro$start_position
range(gensInfo_micro$length)
## [1]  51 148
#Confirms the small number of nucleotides in miRNA
dim(gensInfo_micro) #notice different length of genes, there are some repetitions and some missing values
## [1] 302   9
table(duplicated(gensInfo_micro$mirbase_id))  
## 
## FALSE  TRUE 
##   291    11
gensInfo_micro[duplicated(gensInfo_micro$mirbase_id),]
##        mirbase_id ensembl_gene_id chromosome_name start_position end_position
## 25   hsa-mir-1229 ENSG00000221394               5      179798278    179798346
## 79   hsa-mir-181c ENSG00000207613              19       13874699     13874808
## 81   hsa-mir-181d ENSG00000207585              19       13874875     13875011
## 100  hsa-mir-1976 ENSG00000238705               1       26554542     26554593
## 129   hsa-mir-23a ENSG00000207980              19       13836587     13836659
## 133  hsa-mir-24-2 ENSG00000284387              19       13836287     13836359
## 138   hsa-mir-27a ENSG00000207808              19       13836440     13836517
## 206   hsa-mir-423 ENSG00000283935              17       30117079     30117172
## 242 hsa-mir-509-2 ENSG00000208000               X      147260532    147260625
## 262   hsa-mir-598 ENSG00000207600               8       11035206     11035302
## 277   hsa-mir-675 ENSG00000284010              11        1996759      1996831
##     entrezgene_id hgnc_symbol
## 25      100302156     MIR1229
## 79         406957     MIR181C
## 81         574457     MIR181D
## 100     100302190     MIR1976
## 129        407010      MIR23A
## 133        407013     MIR24-2
## 138        407018      MIR27A
## 206        494335      MIR423
## 242        574514    MIR509-1
## 262        693183      MIR598
## 277     100033819      MIR675
##                                            description length
## 25   microRNA 1229 [Source:HGNC Symbol;Acc:HGNC:33924]     68
## 79   microRNA 181c [Source:HGNC Symbol;Acc:HGNC:31552]    109
## 81   microRNA 181d [Source:HGNC Symbol;Acc:HGNC:32089]    136
## 100  microRNA 1976 [Source:HGNC Symbol;Acc:HGNC:37064]     51
## 129   microRNA 23a [Source:HGNC Symbol;Acc:HGNC:31605]     72
## 133  microRNA 24-2 [Source:HGNC Symbol;Acc:HGNC:31608]     72
## 138   microRNA 27a [Source:HGNC Symbol;Acc:HGNC:31613]     77
## 206   microRNA 423 [Source:HGNC Symbol;Acc:HGNC:31880]     93
## 242 microRNA 509-1 [Source:HGNC Symbol;Acc:HGNC:32146]     93
## 262   microRNA 598 [Source:HGNC Symbol;Acc:HGNC:32854]     96
## 277   microRNA 675 [Source:HGNC Symbol;Acc:HGNC:33351]     72
#12 duplicates need to be removed

length(setdiff(rownames(countsF_micro), gensInfo_micro$mirbase_id)) 
## [1] 24
countsFDF_micro <- data.frame(ID=rownames(countsF_micro),countsF_micro)
countsFInfo_micro <- right_join(countsFDF_micro, gensInfo_micro, by=c("ID"="mirbase_id")) 
countsFInfo_micro <- countsFInfo_micro[!duplicated(countsFInfo_micro$ID),] #After having checked duplications, just keep first result

#To perform FPKM (for paired-end reads) or RPKM (for single-end reads), we first divide by the library size and then by gene length. 
#Notice that the sum of each sample after FPKM normalization is different.Assuming that for short miRNA reads, only single-end sequencing performed
#step 1: normalize for read depth and multiply by million
readD_micro <- apply(countsFInfo_micro[,2:11], 2, function(x) x / sum(x) * 10^6) 

#step 2. scale by gene length and multiply by thousand
countsRPKM_micro <- readD_micro / countsFInfo_micro$length * 10^3
colSums(countsRPKM_micro)
## TCGA.OR.A5J9.01A.11R.A29W.13 TCGA.OR.A5JE.01A.11R.A29W.13 
##                     12246757                     11596788 
## TCGA.OR.A5JF.01A.11R.A29W.13 TCGA.OR.A5JI.01A.11R.A29W.13 
##                     12009815                     12506499 
## TCGA.OR.A5K0.01A.11R.A29W.13 TCGA.OR.A5KV.01A.11R.A29W.13 
##                     11784215                     11737401 
## TCGA.OR.A5L5.01A.11R.A29W.13 TCGA.OR.A5LC.01A.11R.A29W.13 
##                     12113583                     11190660 
## TCGA.OR.A5LE.01A.11R.A29W.13 TCGA.OR.A5LL.01A.11R.A29W.13 
##                     11303870                     11656854
#To perform TPM, we first divide by the gene length and then we divide by the transformed sequencing depth. 
#Check that the sum of each column after TPM normalization equals to 10^6.
sampleTF_micro <- colSums(countsFInfo_micro[,2:11]) 

#step 1: divide by gene length and multiply by thousand to obtain the reads per kilobase (RPK) 
rpk_micro <- countsFInfo_micro[,2:11] / countsFInfo_micro$length * 10^3

#step 2: divide by sequencing depth and multiply by million
countsTPM_micro <- apply(rpk_micro, 2, function(x) x / sum(x) * 10^6)

#check totals (All equal to 1 million)
colSums(countsTPM_micro)
## TCGA.OR.A5J9.01A.11R.A29W.13 TCGA.OR.A5JE.01A.11R.A29W.13 
##                        1e+06                        1e+06 
## TCGA.OR.A5JF.01A.11R.A29W.13 TCGA.OR.A5JI.01A.11R.A29W.13 
##                        1e+06                        1e+06 
## TCGA.OR.A5K0.01A.11R.A29W.13 TCGA.OR.A5KV.01A.11R.A29W.13 
##                        1e+06                        1e+06 
## TCGA.OR.A5L5.01A.11R.A29W.13 TCGA.OR.A5LC.01A.11R.A29W.13 
##                        1e+06                        1e+06 
## TCGA.OR.A5LE.01A.11R.A29W.13 TCGA.OR.A5LL.01A.11R.A29W.13 
##                        1e+06                        1e+06
#PREPARING DATAFRAME FOR FUTURE CNV VS. miRNA-Seq VS. mRNA-Seq CORRELATION ANALYSIS AND MFA
countsF_TPM_LOG_micro<-log2(countsTPM_micro[,1:10]+2)
countsF_TPM_LOG_DF_micro<-as.data.frame(countsF_TPM_LOG_micro)
countsF_TPM_LOG_DF_micro$ID<-countsFInfo_micro$ID
countsF_TPM_LOG_DF_micro$chr<-countsFInfo_micro$chromosome_name
countsF_TPM_LOG_DF_micro$start<-countsFInfo_micro$start_position
countsF_TPM_LOG_DF_micro$end<-countsFInfo_micro$end_position
#PCA for miRNA-Seq
countsF_TPM_LOG_DF_micro_PCAMFA<-countsF_TPM_LOG_DF_micro[,1:10]
#Transpose
countsF_TPM_LOG_DF_micro_PCAMFA.t<-t(countsF_TPM_LOG_DF_micro_PCAMFA)
# assign names, we include a micexp suffix to differentiate genes from cnv or exp
colnames(countsF_TPM_LOG_DF_micro_PCAMFA.t)<-paste(countsF_TPM_LOG_DF_micro$ID,"micexp",sep=".")
#Construct data.frame to perform PCA
miexpr4pca<-data.frame(cond2,countsF_TPM_LOG_DF_micro_PCAMFA.t)
res.pca.miexpr<-PCA(miexpr4pca,quali.sup=1)

res.pca.miexpr
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10 individuals, described by 292 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"
plot(res.pca.miexpr,habillage=1)

#With the exception of young patients A5J9 and A5JI and old patient A5LC, we observe differences between the young and old patient samples (in dim 1 and dim2)
#and 28.27+18.86%=47.13%total variance is captured by the first 2 dimensions, respectively.
 

#Normalization using TMM (edgeR package)
d_micro <- DGEList(counts = countsF_micro)
Norm.Factor_micro <- calcNormFactors(d_micro, method = "TMM")
countsTMM_micro <- cpm(Norm.Factor_micro, log = T)

countsTMMnoLog_micro <- cpm(Norm.Factor_micro, log = F) 
#See how distribution of the three normalizations (in log2) change (for the first sample).

hist(log2(countsRPKM_micro[,1]+2), xlab="log2-ratio", main="RPKM_micro")

#Appears to be a normal distribution of log2-ratios
hist(log2(countsTPM_micro[,1]+2), xlab="log2-ratio", main="TPM_micro") 

#Appears to be a normal distribution of log2-ratios
hist(countsTMM_micro[,1], xlab="log2-ratio", main="TMM_micro") 

#Appears to be a normal distribution of log2-ratios

#Sample aggregation
#To see how samples aggregate, we will perform hierarchical clustering as well as PCA. 
#The purpose is to see whether samples aggregate by condition or there are some outliers, that might have a biological or technical causes.

#Hierarchical clustering
x_micro<-countsTMM_micro

#Euclidean distance
clust.cor.ward_micro <- hclust(dist(t(x_micro)),method="ward.D2")
plot(clust.cor.ward_micro, main="hierarchical clustering", hang=-1,cex=0.8)

#WITH EXCEPTION OF PATIENT TCGA-OR-A5LC, The ward.D2 hierarchal clustering appears to partially reflect the segregation of 5 old and 5 young patients

clust.cor.average_micro <- hclust(dist(t(x_micro)),method="average")
plot(clust.cor.average_micro, main="hierarchical clustering", hang=-1,cex=0.8)

#The average hierarchal clustering appears to partially reflect the segregation of 5 old and 5 young patients

clust.cor.average_micro <- hclust(dist(t(x_micro)),method="complete")
plot(clust.cor.average_micro, main="hierarchical clustering", hang=-1,cex=0.8)

#The complete hierarchal clustering appears to partially reflect the segregation of 5 old and 5 young patients

#Correlation-based distance
clust.cor.ward_micro <- hclust(as.dist(1-cor(x_micro)),method="ward.D2")
plot(clust.cor.ward_micro, main="hierarchical clustering", hang=-1,cex=0.8)

#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

clust.cor.average_micro<- hclust(as.dist(1-cor(x_micro)),method="average")
plot(clust.cor.average_micro, main="hierarchical clustering", hang=-1,cex=0.8) 

#The average hierarchal clustering does not appear to reflect the segregation of 5 old and 5 young patients

cond2<-phenoN_micro$age.status
countsF_micro_backup<-as.matrix(countsF_micro)
sum1<-sum(is.na(countsF_micro_backup))
sum1
## [1] 0
#[1] 0


#Density plot of raw read counts (log10)
countsSF_micro_backup_log <- log(countsF_micro_backup,10) 
d <- density(countsSF_micro_backup_log)
plot(d,xlim=c(1,8),main="",ylim=c(0,.45),xlab="Raw filtered read counts per gene (log10 transformation)", ylab="Density")
for (s in 1:length(colnames(countsSF_micro_backup_log))){
  countsSF_micro_backup_log <- log(countsF_micro_backup[,s],10) 
  d <- density(countsSF_micro_backup_log)
  lines(d)
}

#Box plots of raw filtered read counts after log10 transformation
countsSF_micro_backup_log <- log(countsF_micro_backup,10)
boxplot(countsSF_micro_backup_log , main="", xlab="", ylab="Raw read counts per gene (log10)",axes=FALSE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 4 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 9 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 10 is not drawn
axis(2)
axis(1,at=c(1:length(colnames(countsSF_micro_backup_log))),labels=colnames(countsSF_micro_backup_log),las=2,cex.axis=0.8)

#Heatmap with condition age.status as labels
colnames(countsF_micro_backup)<-phenoN_micro$age.status 
#Plot heatmap
heatmap(countsF_micro_backup, col = topo.colors(50), margin=c(10,6))

#Evidently one young patient is overexpressing many miRNA genes

#PCA
#Transpose the data to have variables (genes) as columns
data_for_PCA2 <- t(countsF_micro_backup)
 
#The cmdscale function will calculate a matrix of dissimilarities from the transposed data 
#and will also provide information about the proportion of explained variance by calculating Eigen values.
## calculate MDS (matrix of dissimilarities)
mds2 <- cmdscale(dist(data_for_PCA2), k=3, eig=TRUE)  

mds2$eig
##  [1]  5.986174e+12  3.577930e+12  1.628514e+12  1.230555e+12  4.871337e+11
##  [6]  4.195289e+11  1.086244e+11  3.114651e+10  9.893546e+09 -5.227671e-04
#Plotting this variable as a percentage will help determine how many components can explain the variability
#in your dataset and thus how many dimensions you should be looking at.

#Transform the Eigen values into percentage
eig_pc2 <- mds2$eig * 100 / sum(mds2$eig)
#Plot the PCA
barplot(eig_pc2,las=1,xlab="Dimensions", ylab="Proportion of explained variance (%)", y.axis=NULL,col="darkgrey")

#In most cases, the first 2 components explain more than half the variability in the dataset and can be used for plotting. 
#The cmdscale function run with default parameters will perform a principal components analysis on the given data matrix and 
#the plot function will provide scatter plots for individuals representation.
 
#Calculate MDS
mds2 <- cmdscale(dist(data_for_PCA2)) # Performs MDS analysis 
#Samples representation
 
plot(mds2[,1], -mds2[,2], type="n", xlab="Dimension 1", ylab="Dimension 2", main="")
text(mds2[,1], -mds2[,2], rownames(mds2), cex=0.8) 

#library ggfortify needed for the autoplot to understand and plot PCA results
summary(pca.filt_micro <- prcomp(t(x_micro), scale=T )) 
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Standard deviation     9.1799 8.4363 6.4439 5.49967 5.13749 4.4086 4.28110
## Proportion of Variance 0.2675 0.2259 0.1318 0.09602 0.08379 0.0617 0.05818
## Cumulative Proportion  0.2675 0.4935 0.6253 0.72131 0.80510 0.8668 0.92498
##                            PC8    PC9      PC10
## Standard deviation     3.71113 3.1398 7.651e-15
## Proportion of Variance 0.04372 0.0313 0.000e+00
## Cumulative Proportion  0.96870 1.0000 1.000e+00
autoplot(pca.filt_micro, data=phenoN_micro, colour="patientID", shape="age.status")

#There does not appear to be segregation by age status
#Note that a total of 26.75%+ 22.59%=49.24% variance is accounted for by the 
#first 2 principal components PC1 and PC2 and corresponding eigenvector values

#LIMMA-BASED Differentially Expressed miRNA genes analysis
cond2<-phenoN_micro$age.status
phenoN_micro[colnames(countsF_micro),]$age.status== cond2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#Create design matrix for limma
design2 <- model.matrix(~0+cond2)
# substitute "cond2" from the design column names
colnames(design2)<- gsub("cond2","",colnames(design2))
# check design matrix
design2
##    old young
## 1    0     1
## 2    0     1
## 3    1     0
## 4    0     1
## 5    1     0
## 6    0     1
## 7    1     0
## 8    1     0
## 9    0     1
## 10   1     0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$cond2
## [1] "contr.treatment"
#calculate normalization factors between libraries
nf2 <- calcNormFactors(countsF_micro)

# normalize the read counts with 'voom' function
y2 <- voom(countsF_micro,design2,lib.size=colSums(countsF_micro)*nf2)
#Extract the Normalized read counts
counts.voom2 <- y2$E

#Fit linear model for each gene given a series of libraries
fit2 <- lmFit(y2,design2)
# construct the contrast matrix corresponding to specified contrasts of a set of parameters
cont.matrix2 <- makeContrasts(old-young,levels=design2)
cont.matrix2 
##        Contrasts
## Levels  old - young
##   old             1
##   young          -1
# compute estimated coefficients and standard errors for a given set of contrasts
fit2 <- contrasts.fit(fit2, cont.matrix2)

# compute moderated t-statistics of differential expression by empirical Bayes moderation of the standard errors
fit2 <- eBayes(fit2)
options(digits=3)

# check the output fit
dim(fit2)
## [1] 315   1
#Set adjusted pvalue threshold and log fold change threshold
mypval=0.01
myfc=3

#Get the coefficient name for the comparison  of interest
colnames(fit2$coefficients)
## [1] "old - young"
mycoef="old - young"
# Get the output table for the 10 most significant DE genes for this comparison
topTable(fit2,coef=mycoef)
##                 logFC AveExpr     t P.Value adj.P.Val     B
## hsa-mir-542    -1.264   10.63 -2.97  0.0134     0.566 -4.53
## hsa-let-7e     -1.067   12.29 -2.54  0.0286     0.566 -4.53
## hsa-mir-10a     1.167   14.49  2.18  0.0533     0.566 -4.54
## hsa-mir-28      0.696   10.95  2.25  0.0469     0.566 -4.55
## hsa-let-7f-2   -0.714   14.31 -1.95  0.0788     0.566 -4.55
## hsa-mir-483    -4.348    8.07 -2.64  0.0237     0.566 -4.55
## hsa-mir-29a     0.930   11.94  2.01  0.0707     0.566 -4.55
## hsa-mir-508     3.563   11.27  2.10  0.0606     0.566 -4.56
## hsa-mir-181a-1 -1.057   11.64 -1.99  0.0731     0.566 -4.56
## hsa-mir-98     -1.352    6.43 -2.76  0.0194     0.566 -4.56
#Get the full table ("n = number of genes in the fit")
limma.res <- topTable(fit2,coef=mycoef,n=dim(fit2)[1])

#Get significant DE genes only (adjusted p-value < mypval). 
#The adjusted p-value was increased to obtain a list of genes
limma.res.pval <- topTable(fit2,coef=mycoef,n=dim(fit2)[1],p.val=0.57)
dim(limma.res.pval)
## [1] 69  6
#Get significant DE genes with low adjusted p-value high fold change
limma.res.pval.FC <- limma.res.pval[which(abs(limma.res.pval$logFC)>myfc),]
dim(limma.res.pval.FC)
## [1] 19  6
limma.res.pval.FC
##                logFC AveExpr     t P.Value adj.P.Val     B
## hsa-mir-483    -4.35   8.067 -2.64  0.0237     0.566 -4.55
## hsa-mir-508     3.56  11.273  2.10  0.0606     0.566 -4.56
## hsa-mir-509-2   3.64   7.671  2.25  0.0471     0.566 -4.56
## hsa-mir-509-1   3.61   7.685  2.24  0.0476     0.566 -4.57
## hsa-mir-509-3   3.46   8.024  2.14  0.0570     0.566 -4.57
## hsa-mir-153-2  -4.57   4.647 -2.64  0.0240     0.566 -4.57
## hsa-mir-514-3   3.61   8.133  1.96  0.0766     0.566 -4.57
## hsa-mir-514-1   3.57   8.127  1.95  0.0789     0.566 -4.57
## hsa-mir-514-2   3.58   8.101  1.92  0.0832     0.566 -4.57
## hsa-mir-511-2   3.05   0.532  2.90  0.0151     0.566 -4.57
## hsa-mir-514b    3.61   2.630  2.23  0.0489     0.566 -4.58
## hsa-mir-513c    3.54   3.795  2.02  0.0693     0.566 -4.58
## hsa-mir-506     3.29   5.270  1.83  0.0960     0.566 -4.58
## hsa-mir-513a-1  4.19   1.740  2.20  0.0515     0.566 -4.58
## hsa-mir-412    -3.76   5.697 -1.71  0.1170     0.566 -4.58
## hsa-mir-153-1  -3.89   0.441 -2.16  0.0550     0.566 -4.58
## hsa-mir-507     3.01   2.743  1.83  0.0954     0.566 -4.58
## hsa-mir-329-2  -3.04   1.730 -1.90  0.0848     0.566 -4.58
## hsa-mir-513a-2  3.44   1.763  1.69  0.1211     0.566 -4.59
#Standard edgeR differential expression analysis
design <- model.matrix(~ cond2)

# Using trended dispersions
dge <- DGEList(counts = countsF_micro)
dge <- calcNormFactors(dge)
dge$samples$age.status <- cond2
dge <- estimateGLMCommonDisp(dge, design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)

# Fit GLM model for strain effect
fit <- glmFit(dge, design)
lrt <- glmLRT(fit)

#Table of unadjusted p-values (PValue) and FDR values
p_val_DE_edgeR <- topTags(lrt, adjust.method = 'BH', n = Inf)

# Getting top differentially expressed miRNA's
top_miRNAs <- rownames(p_val_DE_edgeR$table)[1:10]
top_miRNAs
##  [1] "hsa-mir-153-2" "hsa-mir-153-1" "hsa-mir-541"   "hsa-mir-412"  
##  [5] "hsa-mir-3200"  "hsa-mir-675"   "hsa-mir-1248"  "hsa-mir-9-2"  
##  [9] "hsa-mir-9-1"   "hsa-mir-1229"
#DESeq2 DIFFERENTAILLY EXPRESSED GENE ANALYSIS

sum_na<-sum(is.na(countsF_micro))
#DESeq2 on COUNT MATRIX:
#Filtering is also advised by DESeq2, so we will create the DESeqDataSet from the filtered counts matrix.
countsF_int_micro<-countsF_micro
object.size(countsF_int_micro)
## 49296 bytes
mode(countsF_int_micro) <- "integer"
object.size(countsF_int_micro)
## 36696 bytes
dds_micro <- DESeqDataSetFromMatrix(countData = countsF_int_micro,colData = phenoN_micro,design = ~ age.status) 
#To benefit from the default settings of the package, you should put the variable of interest at 
#the end of the formula and make sure the control level is the first level. This is not necessary if contrast option is used as here
dds_micro <- DESeq(dds_micro)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.
## final dispersion estimates
## fitting model and testing
# Global model
resG_micro <- results(dds_micro, alpha=0.05) #lfcThreshold is by default 0
summary(resG_micro)
## 
## out of 315 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 12, 3.8%
## LFC < 0 (down)     : 2, 0.63%
## outliers [1]       : 11, 3.5%
## low counts [2]     : 0, 0%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#Contrasts, we just check two of them
res1_micro <- results(dds_micro, contrast=c("age.status","old","young"))
summary(res1_micro)
## 
## out of 315 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 6, 1.9%
## LFC < 0 (down)     : 18, 5.7%
## outliers [1]       : 11, 3.5%
## low counts [2]     : 0, 0%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
res1DF_micro <- as.data.frame(res1_micro)
res1DFS_micro <- res1DF_micro[order(res1DF_micro$pvalue),]
res1DFSign_micro <- res1DFS_micro[!is.na(res1DFS_micro$pvalue) & res1DFS_micro$pvalue<0.05, ]
res1DFSign_micro
##                baseMean log2FoldChange lfcSE  stat   pvalue    padj
## hsa-mir-153-2     789.7         -4.706 1.151 -4.09 4.32e-05 0.00827
## hsa-mir-3200       90.3         -3.007 0.745 -4.04 5.44e-05 0.00827
## hsa-mir-675       275.3          3.015 0.856  3.52 4.30e-04 0.02790
## hsa-mir-153-1      39.6         -4.825 1.394 -3.46 5.38e-04 0.02790
## hsa-mir-148b      854.5         -0.972 0.286 -3.40 6.79e-04 0.02790
## hsa-mir-9-2     43653.9         -3.558 1.067 -3.33 8.53e-04 0.02790
## hsa-mir-542      8920.8         -1.268 0.382 -3.32 9.14e-04 0.02790
## hsa-mir-541       327.6         -4.658 1.406 -3.31 9.26e-04 0.02790
## hsa-mir-9-1     43735.3         -3.541 1.070 -3.31 9.35e-04 0.02790
## hsa-mir-412      2806.0         -4.575 1.389 -3.29 9.93e-04 0.02790
## hsa-mir-1229       50.8         -3.170 0.964 -3.29 1.01e-03 0.02790
## hsa-mir-511-1      18.4          2.375 0.754  3.15 1.63e-03 0.03872
## hsa-mir-98        510.6         -1.435 0.456 -3.15 1.66e-03 0.03872
## hsa-mir-887       496.3         -2.218 0.714 -3.11 1.90e-03 0.04122
## hsa-mir-9-3        85.3         -3.593 1.190 -3.02 2.53e-03 0.05121
## hsa-mir-380       184.0         -3.739 1.270 -2.94 3.23e-03 0.06141
## hsa-mir-421        27.0         -1.422 0.489 -2.91 3.63e-03 0.06493
## hsa-mir-222        67.1          1.915 0.683  2.81 5.02e-03 0.08410
## hsa-mir-221       179.4          1.801 0.645  2.79 5.26e-03 0.08410
## hsa-mir-28      10061.0          0.723 0.264  2.74 6.10e-03 0.08902
## hsa-mir-103-2      87.5         -1.238 0.453 -2.74 6.24e-03 0.08902
## hsa-mir-598       549.9         -2.123 0.779 -2.72 6.44e-03 0.08902
## hsa-mir-1287       47.6          2.079 0.774  2.68 7.28e-03 0.09616
## hsa-mir-432      3174.8         -3.707 1.393 -2.66 7.79e-03 0.09873
## hsa-mir-324       633.5         -1.850 0.702 -2.63 8.42e-03 0.10239
## hsa-mir-200c      102.6          2.479 0.953  2.60 9.32e-03 0.10897
## hsa-let-7e      28565.1         -1.161 0.449 -2.58 9.75e-03 0.10973
## hsa-mir-329-2      58.7         -3.059 1.198 -2.55 1.07e-02 0.11602
## hsa-mir-339       177.1          0.693 0.274  2.52 1.16e-02 0.12180
## hsa-mir-135a-1     59.5          2.954 1.181  2.50 1.24e-02 0.12564
## hsa-mir-103-1  112148.6         -1.223 0.499 -2.45 1.44e-02 0.12881
## hsa-mir-16-1     1276.7          1.146 0.469  2.45 1.45e-02 0.12881
## hsa-mir-410      2400.5         -3.340 1.371 -2.44 1.48e-02 0.12881
## hsa-mir-3648       38.5         -2.465 1.015 -2.43 1.52e-02 0.12881
## hsa-mir-431      4569.2         -3.348 1.381 -2.42 1.53e-02 0.12881
## hsa-mir-141        24.7          2.439 1.010  2.42 1.57e-02 0.12881
## hsa-mir-889      4421.5         -3.116 1.296 -2.40 1.62e-02 0.12881
## hsa-mir-668        40.3         -3.202 1.333 -2.40 1.63e-02 0.12881
## hsa-mir-769       122.6         -1.054 0.440 -2.40 1.65e-02 0.12881
## hsa-mir-424      3299.2          1.242 0.528  2.35 1.87e-02 0.14194
## hsa-mir-10a    138399.9          1.213 0.526  2.31 2.11e-02 0.15675
## hsa-mir-301a      115.2         -1.684 0.752 -2.24 2.51e-02 0.18156
## hsa-mir-217       140.6          2.284 1.032  2.21 2.69e-02 0.18986
## hsa-mir-128-1     464.0         -0.717 0.327 -2.19 2.84e-02 0.19031
## hsa-mir-511-2      15.9          2.290 1.048  2.19 2.89e-02 0.19031
## hsa-mir-214        30.9          1.658 0.760  2.18 2.91e-02 0.19031
## hsa-mir-139     19962.0         -2.161 0.992 -2.18 2.94e-02 0.19031
## hsa-mir-758      1362.4         -2.636 1.216 -2.17 3.03e-02 0.19166
## hsa-mir-375       348.0          2.093 0.975  2.15 3.18e-02 0.19712
## hsa-mir-370      1355.1         -2.782 1.305 -2.13 3.30e-02 0.20073
## hsa-mir-497        75.9          1.586 0.750  2.11 3.46e-02 0.20292
## hsa-mir-33b        16.7          1.280 0.607  2.11 3.51e-02 0.20292
## hsa-mir-496       360.8         -2.537 1.209 -2.10 3.59e-02 0.20292
## hsa-mir-223       320.8          1.059 0.507  2.09 3.69e-02 0.20292
## hsa-mir-361      3451.4          1.127 0.540  2.09 3.70e-02 0.20292
## hsa-mir-425      1296.0         -1.273 0.612 -2.08 3.75e-02 0.20292
## hsa-mir-362       149.1          1.462 0.705  2.07 3.80e-02 0.20292
## hsa-mir-329-1      59.2         -2.819 1.373 -2.05 4.01e-02 0.21019
## hsa-mir-503      3750.5         -1.338 0.661 -2.03 4.28e-02 0.22043
## hsa-mir-433       353.8         -2.892 1.440 -2.01 4.47e-02 0.22635
## hsa-mir-382      2017.9         -2.516 1.271 -1.98 4.78e-02 0.23809
#Volcano plot

colorS <- c("blue", "grey", "red")
#CHECK p or p.adj

#specific parameters
showGenes <- 20 #genes to be displayed with names

dataV <- topTable(fit2, n = Inf, coef = mycoef, adjust = "fdr")
dataV <- dataV %>% mutate(gene = rownames(dataV), logp = -(log10(P.Value)), logadjp = -(log10(adj.P.Val)),
                          FC = ifelse(logFC>0, 2^logFC, -(2^abs(logFC)))) %>%
  mutate(sig = ifelse(P.Value<0.01 & logFC > 1, "UP", ifelse(P.Value<0.01 & logFC < (-1), "DN","n.s"))) #ideally we should have an adj.P.Val < 0.05

p <- ggplot(data=dataV, aes(x=logFC, y=logp )) +
  geom_point(alpha = 1, size= 1, aes(col = sig)) + 
  scale_color_manual(values = colorS) +
  xlab(expression("log"[2]*"FC")) + ylab(expression("-log"[10]*"(p.val)")) + labs(col=" ") + 
  geom_vline(xintercept = 1, linetype= "dotted") + geom_vline(xintercept = -1, linetype= "dotted") + 
  geom_hline(yintercept = -log10(0.1), linetype= "dotted")  +  theme_bw()

p <- p + geom_text_repel(data = head(dataV[dataV$sig != "n.s",],showGenes), aes(label = gene)) 

print(p)

#Evidently, based on first limma-based DEG model, expression of gene hsa-mir-511-1 and hsa-mir-675 are significantly upregulated 
#as a function of age status factor (levels young/old)

#Heatmap
#Plotting heatmap results for the limma model (without adjusting for variable patientID).

t1 <- topTable(fit2, n = Inf, coef = mycoef, adjust = "fdr")
res1 <- t1[t1$P.Value<0.01 & abs(t1$logFC) > 1,]

data.clus <- countsTMM_micro[rownames(res1),]

cond2.df <- as.data.frame(cond2)
rownames(cond2.df) <- colnames(data.clus)
pheatmap(data.clus, scale = "row", show_rownames = TRUE, annotation_col = cond2.df)

#Evidently,  miRNA genes hsa-mir-511-1 is overepxressed in old patient A5LL, A5L5 and underexpressed in young patients
#A5LE and A5J9 and A5KV. 
#On the other hand, miRNA gene hsa-mir-675 is underexpressed in young patients A5J9, A5JI, A5K0, A5JE,
#A5KV and overexpressed in A5LL, A5JF, and slightly in A5LC, A5L5.

#GENE ANNOTATION AND GENE ONTOLOGY FOR DIFFERENTIALLY OVEREXPRESSED miRNA GENES
#Load the library
#The central ID for org.Hs.eg.db, a genome-wide annotation for humans based on Entrez Gene, is the NCBI Gene ID.
#org.Hs.egACCNUM is an R object that contains mappings between Entrez Gene identifiers and
#GenBank accession numbers.

# Define list of genes of interest (DE genes - EntrezGene IDs)
mirbase_ids <- as.character(rownames(limma.res.pval.FC))
length(mirbase_ids)
## [1] 19
#We explore gene ontology for 2 select, significantly diiferentially expressed or high logfold changed miRNA genes
#and convert and obtain ENTREZ gene IDs for GoSTATS
genes_mirbase <- c(mirbase_ids[1], rownames(dataV)[11])
 
genes_ensembl1<-countsFInfo_micro[countsFInfo_micro$ID == genes_mirbase[1],12]
genes_ensembl2<-countsFInfo_micro[countsFInfo_micro$ID == genes_mirbase[2],12]
#genes_ensembl3<-countsFInfo_micro[countsFInfo_micro$ID == "hsa-mir-511-1",12]
genes_ensembl<-c(genes_ensembl1,genes_ensembl2)
genes_ensembl
## [1] "ENSG00000207805" "ENSG00000288367"
mapIds(org.Hs.eg.db,keys = genes_ensembl,column = 'ENTREZID',keytype = 'ENSEMBL')
## 'select()' returned 1:1 mapping between keys and columns
## ENSG00000207805 ENSG00000288367 
##        "619552"     "100033819"
select(org.Hs.eg.db,keys = genes_ensembl,column = c('SYMBOL', 'ENTREZID', 'ENSEMBL'),keytype = 'ENSEMBL')
## 'select()' returned 1:1 mapping between keys and columns
##           ENSEMBL SYMBOL  ENTREZID
## 1 ENSG00000207805 MIR483    619552
## 2 ENSG00000288367 MIR675 100033819
genes_entrez<-c("619552","100033819")

#Define the universe as all the BioMart-obtained ENTREZ GENE IDs corresponding to our non-duplicated miRNA genes
universeids <- as.character(countsFInfo_micro[,16])
length(universeids)
## [1] 291
#Before running the hypergeometric test with the hyperGTest function, we need to define the parameters
#for the test (gene lists, ontology, test direction) as well as the annotation database to be used. 
#The ontology to be tested can be any of the three GO domains: biological process (“BP”), cellular component (“CC”) or molecular function (“MF”).
#We will test for over-represented biological processes in our list of differentially expressed genes.

# define the p-value cut off for the hypergeometric test
hgCutoff <- 0.05

params <- new("GOHyperGParams",annotation="org.Hs.eg",geneIds=genes_entrez,universeGeneIds=universeids,ontology="BP",pvalueCutoff=hgCutoff,testDirection="over")
## Warning in makeValidParams(.Object): removing duplicate IDs in universeGeneIds
#Run the test
hg <- hyperGTest(params)
#Check results
hg
## Gene to GO BP  test for over-representation 
## 326 GO BP ids tested (68 have p < 0.05)
## Selected gene set size: 2 
##     Gene universe size: 257 
##     Annotation package: org.Hs.eg
#We can get the output table from the test for significant GO terms only by adjusting the pvalues with the p.adjust function.

#Get the p-values of the test
hg.pv <- pvalues(hg)
#Adjust p-values for multiple test (FDR)
hg.pv.fdr <- p.adjust(hg.pv,'fdr')
#select the GO terms with adjusted p-value less than the cut off
#sigGO.ID <- names(hg.pv.fdr[hg.pv.fdr < hgCutoff])
#select the GO terms with NON-adjusted p-value less than the cut off
sigGO.ID <- names(hg.pv[pvalues(hg) < hgCutoff])
length(sigGO.ID)
## [1] 68
#Get table from HyperG test result
df <- summary(hg)
#Keep only significant GO terms in the table
GOannot.table <- df[df[,1] %in% sigGO.ID,]
head(GOannot.table)
##       GOBPID  Pvalue OddsRatio ExpCount Count Size
## 1 GO:0010563 0.00201       Inf   0.0934     2   12
## 2 GO:0045936 0.00201       Inf   0.0934     2   12
## 3 GO:0006793 0.00638       Inf   0.1634     2   21
## 4 GO:0006796 0.00638       Inf   0.1634     2   21
## 5 GO:0019220 0.00638       Inf   0.1634     2   21
## 6 GO:0051174 0.00638       Inf   0.1634     2   21
##                                                  Term
## 1 negative regulation of phosphorus metabolic process
## 2  negative regulation of phosphate metabolic process
## 3                        phosphorus metabolic process
## 4     phosphate-containing compound metabolic process
## 5           regulation of phosphate metabolic process
## 6          regulation of phosphorus metabolic process
#Evidently, our statistically differentially expressed miRNA genes are associated with regualtion of phosphorous metabolism

#The R package multiMiR, with web server at http://multimir.org, is a comprehensive collection of predicted and validated miRNA-target 
#interactions and their associations with diseases and drugs.
#To retrieve validated miRNA -target gene interaction yielded ~11 000 target genes suggesting that over 50% of human genes are under microRNA regulation.

vers_table <- multimir_dbInfoVersions()
vers_table
##   VERSION    UPDATED                      RDA      DBNAME
## 1   2.3.0 2020-04-15 multimir_cutoffs_2.3.rda multimir2_3
## 2   2.2.0 2017-08-08 multimir_cutoffs_2.2.rda multimir2_2
## 3   2.1.0 2016-12-22 multimir_cutoffs_2.1.rda multimir2_1
## 4   2.0.0 2015-05-01     multimir_cutoffs.rda    multimir
##                   SCHEMA PUBLIC                TABLES
## 1 multiMiR_DB_schema.sql      1 multiMiR_dbTables.txt
## 2 multiMiR_DB_schema.sql      1 multiMiR_dbTables.txt
## 3 multiMiR_DB_schema.sql      1 multiMiR_dbTables.txt
## 4 multiMiR_DB_schema.sql      1 multiMiR_dbTables.txt
curr_vers  <- vers_table[1, "VERSION"]  # current version
multimir_switchDBVersion(db_version = curr_vers)
## Now using database version: 2.3.0
#Now using database version: 2.3.0
#The function multimir_dbInfo() will display information about the external miRNA and miRNA-target databases in multiMiR, 
#including version, release date, link to download the data, and the corresponding table in multiMiR.
db.info = multimir_dbInfo()
db.info
##        map_name                  source_name source_version  source_date
## 1  diana_microt                 DIANA-microT              5   Sept, 2013
## 2         elmmo                        EIMMo              5    Jan, 2011
## 3     microcosm                    MicroCosm              5   Sept, 2009
## 4   mir2disease                  miR2Disease                Mar 14, 2011
## 5       miranda                      miRanda                   Aug, 2010
## 6         mirdb                        miRDB              6   June, 2019
## 7     mirecords                    miRecords              4 Apr 27, 2013
## 8    mirtarbase                   miRTarBase            7.0   Sept, 2017
## 9  pharmaco_mir Pharmaco-miR (Verified Sets)                            
## 10     phenomir                     PhenomiR              2 Feb 15, 2011
## 11       pictar                       PicTar              2 Dec 21, 2012
## 12         pita                         PITA              6 Aug 31, 2008
## 13      tarbase                      TarBase              8         2018
## 14   targetscan                   TargetScan            7.2  March, 2018
##                                                                                 source_url
## 1           http://diana.imis.athena-innovation.gr/DianaTools/index.php?r=microT_CDS/index
## 2                                  http://www.mirz.unibas.ch/miRNAtargetPredictionBulk.php
## 3                http://www.ebi.ac.uk/enright-srv/microcosm/cgi-bin/targets/v5/download.pl
## 4                                                               http://www.mir2disease.org
## 5                                         http://www.microrna.org/microrna/getDownloads.do
## 6                                                                         http://mirdb.org
## 7                                                http://mirecords.biolead.org/download.php
## 8                                       http://mirtarbase.mbc.nctu.edu.tw/php/download.php
## 9                                       http://www.pharmaco-mir.org/home/download_VERSE_db
## 10                                             http://mips.helmholtz-muenchen.de/phenomir/
## 11                                                             http://dorina.mdc-berlin.de
## 12                                  http://genie.weizmann.ac.il/pubs/mir07/mir07_data.html
## 13 http://carolina.imis.athena-innovation.gr/diana_tools/web/index.php?r=tarbasev8%2Findex
## 14               http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_61
#Among the 14 external databases, eight contain predicted miRNA-target interactions (DIANA-microT-CDS, ElMMo, MicroCosm, miRanda, miRDB, PicTar, PITA, and TargetScan),
#three have experimentally validated miRNA-target interactions (miRecords, miRTarBase, and TarBase) and the remaining three contain miRNA-drug/disease associations
#(miR2Disease, Pharmaco-miR, and PhenomiR). To check these categories and databases from within R, we have a set of four helper functions:
predicted_tables()
## [1] "diana_microt" "elmmo"        "microcosm"    "miranda"      "mirdb"       
## [6] "pictar"       "pita"         "targetscan"
validated_tables()
## [1] "mirecords"  "mirtarbase" "tarbase"
#get_multimir() is the main function in the package to retrieve predicted and validated miRNA-target 
#interactions and their disease and drug associations from the multiMiR database.

#Plug miRNA's into multiMiR and getting validated targets
#multimir_target_results <- get_multimir(org = 'mmu', mirna  = "hsa-mir-382", table   = 'predicted', summary = TRUE)

#Retrieving all gene targets of miRNA gene hsa-miR-107 and miRNA genes previously determined to be 
#statistically significantly differentially expressed by age.status in our dataframe and list from combining limma+DESeq2+EDGER approaches:

#"hsa-mir-153-2" "hsa-mir-153-1" "hsa-mir-541"   "hsa-mir-412"   "hsa-mir-3200"  
#"hsa-mir-675"   "hsa-mir-1248"  "hsa-mir-9-2"   "hsa-mir-9-1"   "hsa-mir-1229" , "hsa-mir-511-1","hsa-mir-507","hsa-mir-107"
#hsa-mir-148b   hsa-mir-542  hsa-mir-98 hsa-mir-887 hsa-mir-9-3

example1 <- get_multimir(mirna  = countsFInfo_micro[18,1]  , summary = TRUE)
## Searching mirecords ...
## Searching mirtarbase ...
## Searching tarbase ...
head(example1@data)
##    database mature_mirna_acc mature_mirna_id target_symbol target_entrez
## 1 mirecords     MIMAT0000104     hsa-miR-107         BACE1         23621
## 2 mirecords     MIMAT0000104     hsa-miR-107        SERBP1         26135
## 3 mirecords     MIMAT0000104     hsa-miR-107          AGO1         26523
## 4 mirecords     MIMAT0000104     hsa-miR-107          AGO2         27161
## 5 mirecords     MIMAT0000104     hsa-miR-107          AGO3        192669
## 6 mirecords     MIMAT0000104     hsa-miR-107         CCNE1           898
##    target_ensembl                experiment support_type pubmed_id      type
## 1 ENSG00000186318 Luciferase activity assay               18234899 validated
## 2 ENSG00000142864                                         17637574 validated
## 3 ENSG00000092847              Western blot               20042474 validated
## 4 ENSG00000123908              Western blot               20042474 validated
## 5 ENSG00000126070              Western blot               20042474 validated
## 6 ENSG00000105173                                         19688090 validated
#rownames(limma.res.pval.FC)="hsa-mir-507"
example2 <- get_multimir(mirna  = "hsa-mir-507"  , summary = TRUE)
## Searching mirecords ...
## Searching mirtarbase ...
## Searching tarbase ...
head(example2@data)
##     database mature_mirna_acc mature_mirna_id target_symbol target_entrez
## 1 mirtarbase     MIMAT0002879     hsa-miR-507         CLOCK          9575
## 2 mirtarbase     MIMAT0002879     hsa-miR-507         MYO10          4651
## 3 mirtarbase     MIMAT0002879     hsa-miR-507         MYO10          4651
## 4 mirtarbase     MIMAT0002879     hsa-miR-507         RBM47         54502
## 5 mirtarbase     MIMAT0002879     hsa-miR-507         CAND1         55832
## 6 mirtarbase     MIMAT0002879     hsa-miR-507          POGK         57645
##    target_ensembl experiment          support_type pubmed_id      type
## 1 ENSG00000134852  HITS-CLIP Functional MTI (Weak)  23824327 validated
## 2 ENSG00000145555   PAR-CLIP Functional MTI (Weak)  22012620 validated
## 3 ENSG00000145555   PAR-CLIP Functional MTI (Weak)  21572407 validated
## 4 ENSG00000163694  HITS-CLIP Functional MTI (Weak)  23824327 validated
## 5 ENSG00000111530   PAR-CLIP Functional MTI (Weak)  24398324 validated
## 6 ENSG00000143157   PAR-CLIP Functional MTI (Weak)  20371350 validated
example3 <- get_multimir(mirna  = "hsa-mir-1248", summary = TRUE)
## Searching mirecords ...
## Searching mirtarbase ...
## Searching tarbase ...
head(example3@data)
##     database mature_mirna_acc mature_mirna_id target_symbol target_entrez
## 1 mirtarbase     MIMAT0005900    hsa-miR-1248         LMNB1          4001
## 2 mirtarbase     MIMAT0005900    hsa-miR-1248        CDKN1A          1026
## 3 mirtarbase     MIMAT0005900    hsa-miR-1248         PRRG4         79056
## 4 mirtarbase     MIMAT0005900    hsa-miR-1248           SP1          6667
## 5 mirtarbase     MIMAT0005900    hsa-miR-1248           MYC          4609
## 6 mirtarbase     MIMAT0005900    hsa-miR-1248         HMGB1          3146
##    target_ensembl experiment          support_type pubmed_id      type
## 1 ENSG00000113368  HITS-CLIP Functional MTI (Weak)  23313552 validated
## 2 ENSG00000124762   PAR-CLIP Functional MTI (Weak)  21572407 validated
## 3 ENSG00000135378  HITS-CLIP Functional MTI (Weak)  23824327 validated
## 4 ENSG00000185591  HITS-CLIP Functional MTI (Weak)  23824327 validated
## 5 ENSG00000136997       TRAP Functional MTI (Weak)  24510096 validated
## 6 ENSG00000189403  HITS-CLIP Functional MTI (Weak)  23824327 validated
#Of all in the DGE miRNA gene list, only 3 were successfully queried with get_multimir to identify their mRNA targets
#Of all identified targets of these 3, only CDKN1A target of hsa-miR-1248 and SERBP1 target of hsa-miR-107 appear distantly related (by gene symbol similarity) 
#to the RNA-seq DGE genes of CDKN2A and SERPINE1. We will therefore plot these miRNA expression levels

#Using alternative approach, we additionally obtain the targets from `r Biocpkg("RmiR.Hs.miRNA")` using the connection to TargetScan, 
#and the function miRNAGenes we will use later on to obtain the target for each differentially miRNA obtained.
#We will obtain the targets from RmiR.Hs.miRNA using the connection to TargetScan in function miRNAGenes. 
#In addition, this function will use biomaRt to retrieve the HGNC symbols. 
#This is the function we will use later on to obtain the target for each differentially miRNA obtained and for miRNA vs. mRNA correlation analysis.

#miRNA database and biomaRt connections 
dbListTables(RmiR.Hs.miRNA_dbconn())
## [1] "miranda"    "mirbase"    "mirtarget2" "pictar"     "tarbase"   
## [6] "targetscan"
#An example connecting to tarbase
#dbGetQuery(RmiR.Hs.miRNA_dbconn(),"SELECT * FROM tarbase WHERE mature_miRNA='hsa-miR-21'")
#ensembl=useMart("ensembl",dataset="hsapiens_gene_ensembl")

ensembl3 <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl") #using useEnsembl instead of useMart

miRNAGenes<-function(miRNA){
  # OLD VERSIONS: Function to obtain gene targets from all databases given a miRNA
  # query.targetscan <- "SELECT * FROM targetscan WHERE mature_miRNA=?"
  targetscan <- dbReadTable(RmiR.Hs.miRNA_dbconn(), "targetscan")[,1:2]
  class(targetscan)#dataframe
  gens<-array(NA)
  gens.sel.symbol<- ""
  # OLD VERSIONS
  # g.targetscan <- dbGetPreparedQuery(RmiR.Hs.miRNA_dbconn(), query.targetscan,bind.data=as.data.frame(miRNA))$gene_id
  #Warning message:RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data). 
  #g.targetscan <- DBI::dbGetQuery(RmiR.Hs.miRNA_dbconn(), query.targetscan, bind.data=as.data.frame(miRNA))$miRNA
  g.targetscan <- targetscan[targetscan$mature_miRNA ==miRNA,"gene_id" ]
  if (length(g.targetscan)>0) {
    gens.sel.symbol<-getBM(attributes="hgnc_symbol",filters="entrezgene_id",values=g.targetscan,mart=ensembl3)$hgnc_symbol
  }
  return(gens.sel.symbol)
}


#TESTED THIS FUNCTION ON SEVERAL SETS OF SIGNIFICANT DGE miRNA genes:
#miRNAs_test<-rownames(limma.res.pval.FC)
#miRNAs_test<-rownames(assay(mACC.mir3))
miRNAs_test<-c("hsa-miR-107" )
for (i in miRNAs_test){
  miRNA.genes_test<-miRNAGenes(i) 
  
}

miRNA.genes_test
##   [1] "ABCF2"      "GPC6"       "ACTR2"      "TSPAN5"     "YAF2"      
##   [6] "CDK6"       "CDK8"       "SPRY3"      "CORO2B"     "ARIH2"     
##  [11] "VAV3"       "CARM1"      "AGPAT1"     "ERLIN1"     "EXOC5"     
##  [16] "ENTREP3"    "NUP50"      "WASF3"      "TSPAN9"     "MMP24"     
##  [21] "FERMT2"     "CHD1"       "ABHD2"      "CHD2"       "CLASRP"    
##  [26] "BAZ2A"      "AKAP13"     "PDCD10"     "PRRT2"      "CHRM1"     
##  [31] "SLC2A13"    "SLITRK1"    "SLC26A7"    "MARCHF3"    "ADCYAP1"   
##  [36] "CLCN5"      "ADD2"       "ARL8A"      "SYT2"       "UBR3"      
##  [41] "DCBLD2"     "CSNK1G2"    "FAM81A"     "SYT6"       "RC3H1"     
##  [46] "CTNND1"     "FAM117B"    "BTLA"       "RNF38"      "NEK10"     
##  [51] "CREBRF"     "AMOT"       "SLC35G1"    "KANK4"      "DLG4"      
##  [56] "RCAN1"      "EBF1"       "AGO4"       "SCAMP5"     "EFNB2"     
##  [61] "CELSR2"     "EIF1AX"     "EIF4B"      "EIF5"       "CC2D1B"    
##  [66] "HACD2"      "EN2"        "ENSA"       "FAM219A"    "ZNF449"    
##  [71] "AK2"        "USF3"       "ESR1"       "ETV6"       "LRRC55"    
##  [76] "RTKN2"      "RBM24"      "ATXN7L1"    "ZNRF2"      "FGF7"      
##  [81] "COBLL1"     "RAB11FIP2"  "CPEB3"      "SLITRK3"    "FOXJ3"     
##  [86] "DKK1"       "IGSF9B"     "ZHX3"       "PEG10"      "FSTL4"     
##  [91] "TNRC6B"     "HIC2"       "GPATCH8"    "DCUN1D4"    "GGA3"      
##  [96] "SEPTIN8"    "FLOT2"      "FAF2"       "SIK2"       "PLCB1"     
## [101] "PPIP5K2"    "ZC3H7B"     "MGA"        "KLHL18"     "SATB2"     
## [106] "RPGRIP1L"   "WASHC4"     "ICE1"       "DICER1"     "ZFPM2"     
## [111] "TARDBP"     "SLC35A3"    "SUZ12"      "SH3BP4"     "BCL2L13"   
## [116] "CNOT6L"     "GABRB1"     "SCML4"      "GABRG2"     "BCLAF3"    
## [121] "SUN2"       "TMEM184B"   "RNF19A"     "ADGRA2"     "HIGD1A"    
## [126] "SPATS2L"    "UPF2"       "APPL1"      "RAI14"      "POLDIP2"   
## [131] "FBXO10"     "AGO1"       "LATS2"      "AP3M1"      "ABL2"      
## [136] "FOXP1"      "GK"         "AFF4"       "VPS4A"      "DISC1"     
## [141] "PCDH17"     "TMEM121B"   "GLUD1"      "GNAI3"      "GNS"       
## [146] "AQP11"      "ANKRD52"    "DLL1"       "FRYL"       "ANK1"      
## [151] "ANK3"       "GRIA4"      "HIPK2"      "HAPSTR1"    "TFCP2L1"   
## [156] "PACSIN1"    "BAZ2B"      "HCFC1"      "HTT"        "HLF"       
## [161] "HMGA1"      "HNRNPA2B1"  "APBA1"      "AGFG1"      "IGSF3"     
## [166] "HTR4"       "KRTAP11-1"  "KY"         "ZC3H12B"    "SYT10"     
## [171] "LANCL3"     "IHH"        "IRF2BP2"    "CCDC178"    "KCNC4"     
## [176] "MIGA1"      "RAB15"      "KIF5A"      "KIF5C"      "KPNA1"     
## [181] "KPNA3"      "KPNA4"      "TNPO1"      "CEP85L"     "C3P1"      
## [186] "CD164L2"    "PCARE"      "ARHGAP5"    "C12orf76"   "SNX30"     
## [191] "ZBTB34"     "LRP1"       "LRP2"       "ARNT"       "MAP4"      
## [196] "MBNL1"      "MECP2"      "MEF2D"      "GALNTL6"    "PALM2AKAP2"
## [201] "MTF1"       "MYBL1"      "MYH9"       "NEDD9"      "NF1"       
## [206] "NFIA"       "NFIB"       "ATP1B2"     "NKTR"       "NOTCH2"    
## [211] "NOVA1"      "NPAS2"      "NTRK2"      "FURIN"      "CD207"     
## [216] "ST8SIA3"    "PHF20"      "RASL12"     "ZDHHC3"     "WNT16"     
## [221] "PDE3B"      "PDE4D"      "UBE2J1"     "ANKFY1"     "HACD3"     
## [226] "SUFU"       "CAB39"      "CDK12"      "SIX4"       "GALNT7"    
## [231] "CDK14"      "PIK3R1"     "PI4KB"      "PITPNA"     "PLAG1"     
## [236] "BCL11A"     "CHIC1"      "LRP1B"      "UBL3"       "WNT4"      
## [241] "CCNJ"       "OTUD4"      "CNNM2"      "SNRK"       "FNBP1L"    
## [246] "ZCCHC2"     "INO80D"     "TMEM260"    "UBE2R2"     "RNF125"    
## [251] "USP47"      "BSDC1"      "ARHGAP17"   "LRRC8D"     "ARMC1"     
## [256] "RFWD3"      "PPP2R5C"    "ZNF654"     "UBE2W"      "FBXW7"     
## [261] "PPP3R1"     "PPP6C"      "ETNK1"      "CDV3"       "KIF21A"    
## [266] "PRKAB2"     "IPO9"       "DCP1A"      "PRKCE"      "FOXJ2"     
## [271] "PAG1"       "ASH1L"      "MYNN"       "PRKG1"      "GPCPD1"    
## [276] "PRMT8"      "KCMF1"      "FEM1C"      "POGLUT1"    "TULP4"     
## [281] "GOPC"       "PELI2"      "ADAMTSL3"   "PPP4R3B"    "PTH"       
## [286] "SRGAP1"     "NUFIP2"     "SEMA6A"     "TWF1"       "SIPA1L2"   
## [291] "SLAIN2"     "ADGRB3"     "SPTBN4"     "RAP2C"      "PURB"      
## [296] "NECTIN1"    "SINHCAF"    "RAN"        "PLEKHA1"    "TMEM35A"   
## [301] "BCL2L2"     "RGS4"       "TGIF2"      "BACH2"      "RPS6KA3"   
## [306] "CLIP1"      "BDNF"       "SALL1"      "SCN1A"      "SCN2A"     
## [311] "SDCBP"      "ANO3"       "DUS1L"      "BLMH"       "ATL2"      
## [316] "TENT4B"     "GREM2"      "ITSN1"      "SH3GL2"     "TNS3"      
## [321] "ST3GAL2"    "GNPNAT1"    "SOWAHC"     "WNK1"       "SLC5A3"    
## [326] "ZBTB8A"     "SLC8A2"     "SLC20A2"    "SLN"        "ZBTB10"    
## [331] "SMARCE1"    "SNCG"       "SOS1"       "SPTBN1"     "ST13"      
## [336] "VAMP1"      "TDG"        "TGFBR2"     "TGFBR3"     "TGM3"      
## [341] "THRB"       "THY1"       "TLE4"       "ACTG1"      "TPD52"     
## [346] "NR2C2"      "UMOD"       "VCL"        "VCP"        "NSD2"      
## [351] "YWHAH"      "ZNF711"     "ZKSCAN1"    "PCGF2"      "TRIM26"    
## [356] "CACNA1C"    "CACNA2D1"   "BTG2"       "CRELD1"     "ST8SIA4"   
## [361] "DCAF10"     "ATP13A3"    "PLEKHF2"    "LIN28A"     "GSTCD"     
## [366] "LONRF3"     "SYNDIG1"    "SVEP1"      "FBXL18"     "NAA15"     
## [371] "CCDC6"      "KMT2D"      "KDM7A"      "VOPP1"      "NDEL1"     
## [376] "CAMK2G"     "RAB1B"      "NRIP1"      "CAPZA2"     "AXIN2"     
## [381] "EOMES"      "FZD4"       "RASSF5"     "TMEM47"     "HSDL1"     
## [386] "USP42"      "ZNRF3"      "EVA1A"      "SYDE2"      "CHD6"      
## [391] "MAF1"       "CAMKK1"     "PCGF5"      "RAB11FIP4"  "DYRK2"     
## [396] "MAP3K21"    "PHYHIPL"    "LCOR"       "CUL4A"      "KRTAP4-4"  
## [401] "MFSD14B"    "OGT"        "PHF5A"      "TMEM25"     "RSPO3"     
## [406] "VCF1"       "AJUBA"      "CNTNAP1"    "STRIP1"     "ZC3H12C"   
## [411] "SSH2"       "CDC14A"     "RUNX1T1"    "IRS2"       "VAMP8"     
## [416] "VAMP4"      "CDC23"      "SNX3"       "RNMT"       "DCAF5"     
## [421] "CDK5R1"     "PER3"       "DDX18"      "HERC2"      "BTRC"      
## [426] "WNT3A"      "NAV1"       "NAV2"       "CCNE1"      "FCHSD1"    
## [431] "TMEM250"    "SH2D2A"     "MTMR4"      "SCAF11"     "YTHDC1"    
## [436] "DCLK1"      "DSEL"       "DLG5"       "CENPBD1P"   "ACVR2B"    
## [441] "KLF4"       "NREP"       "COPS2"      "UBE4A"      "KIF3B"     
## [446] "NMT2"       "MED26"      "PSMF1"      "KIF23"      "CLOCK"     
## [451] "CREB5"      "N4BP1"      "PPP6R2"     "TBKBP1"     "SUSD6"     
## [456] "KIAA0232"   "RIMS3"      "MAML1"      "GIT2"       "JAKMIP2"   
## [461] "C2CD5"      "TLK1"       "ZBTB39"     "NUAK1"      "G3BP2"     
## [466] "MFN2"       "JOSD1"      "HELZ"       "AMMECR1"    "CDC27"     
## [471] "SLC12A6"
#VISUALIZATION OF miRNA-Seq BLOCK DATA

#SUBSET LIST OF ANNOTATED miRNA GENES THAT ARE SIGNIFICANTLY DGE BETWEEN OLD AND YOUNG PATIENTS WITH CORRESPONDING GENE POSITION COORDINATES AND CHROMOSOMES:
countsFInfo_micro_sig<-countsFInfo_micro[countsFInfo_micro$ID %in% c("hsa-mir-153-2", "hsa-mir-153-1", "hsa-mir-541","hsa-mir-412","hsa-mir-3200", "hsa-mir-675","hsa-mir-1248", "hsa-mir-9-2","hsa-mir-9-1","hsa-mir-1229", "hsa-mir-511-1","hsa-mir-507","hsa-mir-107",
                                                                          "hsa-mir-148b", "hsa-mir-542", "hsa-mir-98", "hsa-mir-887", "hsa-mir-9-3"),]
countsFInfo_micro_sig<-countsFInfo_micro_sig[,c("ID", "chromosome_name", "start_position", "end_position")]
countsFInfo_micro_sig
##                ID chromosome_name start_position end_position
## 18    hsa-mir-107              10       89592747     89592827
## 24   hsa-mir-1229  CHR_HG30_PATCH      179799144    179799212
## 27   hsa-mir-1248               3      186786672    186786777
## 61   hsa-mir-148b              12       54337216     54337314
## 65  hsa-mir-153-1               2      219294111    219294200
## 66  hsa-mir-153-2               7      157574336    157574422
## 156  hsa-mir-3200              22       30731557     30731641
## 203   hsa-mir-412              14      101065447    101065537
## 238   hsa-mir-507               X      147230984    147231077
## 252   hsa-mir-541              14      101064495    101064578
## 253   hsa-mir-542               X      134541341    134541437
## 276   hsa-mir-675  CHR_HG28_PATCH        1998778      1998850
## 288   hsa-mir-887               5       15935182     15935260
## 290   hsa-mir-9-1               1      156420341    156420429
## 291   hsa-mir-9-2               5       88666853     88666939
## 292   hsa-mir-9-3              15       89368017     89368106
## 300    hsa-mir-98               X       53556223     53556341
#Based on NCBI, hsa-mir-1229 and hsa-mir-675 are located on chromosomes 5q35.3 and 11
#Gene hsa-mir-511-1 is situated on chromosome 10 at 17845107..17845193
#Evidently, chromosomes #x and 5 has the most (3) significantly DGE miRNA genes

miRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[4]]

#Already a GRanges Object (No need to unlist)
miRNA_expr.gr<-rowRanges(miRNA_expr)
 
#GVIZ VISUALIZATION OF mRNA-Seq Gene Expression for hsa-mir-107 gene on chromosome 10:
miRNA_expr.10<-miRNA_expr.gr[seqnames(miRNA_expr.gr)=='10',]
miRNA_expr.10<-keepSeqlevels(miRNA_expr.10,"10") #to remove undesired levels
exprs.10<-assays(miRNA_expr)$exprs[names(miRNA_expr.10),]
head(exprs.10)
##              TCGA-OR-A5J9-01A-11R-A29W-13 TCGA-OR-A5JE-01A-11R-A29W-13
## hsa-mir-107                           486                          238
## hsa-mir-1287                           14                            4
## hsa-mir-1296                           20                           17
## hsa-mir-1307                        17146                        15253
## hsa-mir-146b                          164                         2008
## hsa-mir-202                         16535                         9335
##              TCGA-OR-A5JF-01A-11R-A29W-13 TCGA-OR-A5JI-01A-11R-A29W-13
## hsa-mir-107                           376                          241
## hsa-mir-1287                           33                           37
## hsa-mir-1296                           16                            8
## hsa-mir-1307                         5148                         6484
## hsa-mir-146b                          497                         3543
## hsa-mir-202                         14761                         2724
##              TCGA-OR-A5K0-01A-11R-A29W-13 TCGA-OR-A5KV-01A-11R-A29W-13
## hsa-mir-107                           346                           88
## hsa-mir-1287                           38                           12
## hsa-mir-1296                           13                            5
## hsa-mir-1307                        10990                         9169
## hsa-mir-146b                          706                          324
## hsa-mir-202                         11136                        13359
##              TCGA-OR-A5L5-01A-11R-A29W-13 TCGA-OR-A5LC-01A-11R-A29W-13
## hsa-mir-107                            77                          287
## hsa-mir-1287                          189                           26
## hsa-mir-1296                            7                           24
## hsa-mir-1307                         3501                        17001
## hsa-mir-146b                         1254                         1124
## hsa-mir-202                          2461                         9924
##              TCGA-OR-A5LE-01A-11R-A29W-13 TCGA-OR-A5LL-01A-11R-A29W-13
## hsa-mir-107                           275                          270
## hsa-mir-1287                           46                           48
## hsa-mir-1296                           31                            3
## hsa-mir-1307                        28446                         9177
## hsa-mir-146b                          602                         1367
## hsa-mir-202                          7464                        21922
chr <- "chr10"
geno <- "hg19"
atrack <- AnnotationTrack(miRNA_expr.10, name = "miRNA-Seq for Gene hsa-mir-107")
gtrack <- GenomeAxisTrack() 
itrack <- IdeogramTrack(gen = geno, chromosome = chr) 

#We choose to set a from and a to in the plotTracks to delimitate the region
dtrack <- DataTrack(data = t(exprs.10), start=start(miRNA_expr.10), end=end(miRNA_expr.10),chromosome = chr, genome = geno,name = "miRNA-Seq for Gene hsa-mir-107")
plotTracks(list(gtrack, atrack, itrack,dtrack),from=89590000 ,to=89600000,type="heatmap", col="blue") #dot plot

#CIRCOS VISUALIZATION:
options(stringsAsFactors = FALSE)  
rr.df_micro<-as.data.frame(rowRanges(miRNA_expr))
rna_micro<-assays(miRNA_expr)$"exprs"
 

#Filtering
SD_micro <-apply(rna_micro,1,sd)
cbind(quantiles <-quantile(SD_micro, probs = seq(0, 1, 0.01)))
##          [,1]
## 0%   0.00e+00
## 1%   0.00e+00
## 2%   3.16e-01
## 3%   5.08e-01
## 4%   8.47e-01
## 5%   1.25e+00
## 6%   1.39e+00
## 7%   1.51e+00
## 8%   1.76e+00
## 9%   2.02e+00
## 10%  2.32e+00
## 11%  2.60e+00
## 12%  2.83e+00
## 13%  2.97e+00
## 14%  3.16e+00
## 15%  3.48e+00
## 16%  4.03e+00
## 17%  4.38e+00
## 18%  4.70e+00
## 19%  5.20e+00
## 20%  5.68e+00
## 21%  6.11e+00
## 22%  7.16e+00
## 23%  8.04e+00
## 24%  8.88e+00
## 25%  9.89e+00
## 26%  1.05e+01
## 27%  1.12e+01
## 28%  1.20e+01
## 29%  1.31e+01
## 30%  1.45e+01
## 31%  1.53e+01
## 32%  1.77e+01
## 33%  1.87e+01
## 34%  2.26e+01
## 35%  2.43e+01
## 36%  2.79e+01
## 37%  3.14e+01
## 38%  3.71e+01
## 39%  4.18e+01
## 40%  4.45e+01
## 41%  4.75e+01
## 42%  4.92e+01
## 43%  5.12e+01
## 44%  5.55e+01
## 45%  6.57e+01
## 46%  7.40e+01
## 47%  8.24e+01
## 48%  9.19e+01
## 49%  9.88e+01
## 50%  1.08e+02
## 51%  1.20e+02
## 52%  1.30e+02
## 53%  1.40e+02
## 54%  1.46e+02
## 55%  1.61e+02
## 56%  1.72e+02
## 57%  1.92e+02
## 58%  2.13e+02
## 59%  2.26e+02
## 60%  2.38e+02
## 61%  2.52e+02
## 62%  2.69e+02
## 63%  2.94e+02
## 64%  3.22e+02
## 65%  3.59e+02
## 66%  3.93e+02
## 67%  4.00e+02
## 68%  4.58e+02
## 69%  5.48e+02
## 70%  5.79e+02
## 71%  6.22e+02
## 72%  6.77e+02
## 73%  7.40e+02
## 74%  8.16e+02
## 75%  9.34e+02
## 76%  1.00e+03
## 77%  1.04e+03
## 78%  1.23e+03
## 79%  1.47e+03
## 80%  1.75e+03
## 81%  1.91e+03
## 82%  2.21e+03
## 83%  2.36e+03
## 84%  2.87e+03
## 85%  3.42e+03
## 86%  3.86e+03
## 87%  4.96e+03
## 88%  5.57e+03
## 89%  6.00e+03
## 90%  7.06e+03
## 91%  7.98e+03
## 92%  8.93e+03
## 93%  1.20e+04
## 94%  1.80e+04
## 95%  2.32e+04
## 96%  3.10e+04
## 97%  5.54e+04
## 98%  1.07e+05
## 99%  3.09e+05
## 100% 7.59e+05
rna.f_micro<-rna_micro[SD_micro>quantiles["98%"],]
rr.df.f_micro<-rr.df_micro[rownames(rna.f_micro),]
T.rr_micro<-data.frame("chr"=rr.df.f_micro$seqnames,"Start"=as.integer(rr.df.f_micro$start),"End"=as.integer(rr.df.f_micro$end),rna.f_micro,row.names=NULL)
par(mar=c(2, 2, 2, 2));


plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=380, cir="hg19", W=4,   type="chr", print.chr.lab=T, scale=T);
circos(R=320, cir="hg19", W=50,  mapping=T.rr_micro,   col.v=4,    type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");

#checkout scale, consider transforming it

range(rna.f_micro) #[1]   2476 206162
## [1]     205 2753979
#Perform log transformation with an offset (as log(0)->-Inf))
T.rr_micro<-data.frame("chr"=rr.df.f_micro$seqnames,"Start"=as.integer(rr.df.f_micro$start),"End"=as.integer(rr.df.f_micro$end),log2(rna.f_micro+1),row.names=NULL)
par(mar=c(2, 2, 2, 2));
plot(c(1,800), c(1,800), type="n", axes=F, xlab="", ylab="", main="");
circos(R=400, cir="hg19", W=4,   type="chr", print.chr.lab=T, scale=T);
circos(R=340, cir="hg19", W=50,  mapping=T.rr_micro,   col.v=4,    type="heatmap2",B=FALSE, cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue");

GISTIC CNV DATA BLOCK ANALYSIS

#Preliminary analysis of individual extracted CNV Summarized Experiment:

#The following text summary is cited from the following url: 
#https://bioconductor.org/packages/devel/bioc/vignettes/CNVRanger/inst/doc/CNVRanger.html  
#Title: Summarization and quantitative trait analysis of CNV ranges
#Author: Vinicius Henrique da Silva1 and Ludwig Geistlinger:  
  
#Copy number variation (CNV) is a frequently observed deviation from the diploid state due to duplication or deletion of genomic regions. 
#Copy Number Variation (CNV's) refers to the duplication or deletion of DNA segments larger than 1 kb. 
#CNV's are structural variations in the genome which range in length between 50 bp and 1 Mbp. 
#Copy number variations or CNVs are the structural variations that cover more than 1kb of DNA sequence. 
#Copy number variation (CNV) is a frequently observed deviation from the diploid state due to duplication or deletion of genomic regions.
#The single nucleotide polymorphism (SNP), on the other hand, is a single nucleotide change or a point mutation that is found in more than 1% of the population.
#Both CNV and SNPs are immensely valuable in genetic screening studies and kinship analysis.There are five forms of CNVs. 
#The first is called a deletion. A loss of a DNA segment can reduce  the copy number of a gene or a group of genes. 
#The second is called tandem duplication. Here, a copy of a chromosomal segment is inserted into an adjacent region. 
#The third is called noncontiguous duplication. Here, a chromosomal segment duplicates and inserts into a distant chromosomal region or a different chromosome. 
#The fourth form is called Multiallelic CNV. A segment of DNA duplicates several times and results in the formation of multiple alleles of a gene. 
#The fifth form is called complex rearrangement.

#CNVs are widespread among humans - on an average 12 CNVs exist per individual in comparison to the reference genome. 
#They have also been shown to play a role in diseases such as autism, breast cancer, obesity, Alzheimer’s disease and schizophrenia among other diseases.
#Germ line versus somatic CNV Germ line CNV are relatively short (a few bp to a few Mbp) copy number changes that the individual inherits from one of the two 
#parental gametes and thus are typically present in 100% of cells.
#Somatic CNV (often called CNA where A stands for alterations or aberration) are copy number changes of any size and amount (from a few bases to whole chromosomes) 
#that happen (and often carry on happening) in cancer cells. Cancer cells can be aneuploid (that means they are largely triploid, tetraploid or even aploid) 
#and can have high focal amplifications (some regions could have many copies: it is not unusual to have 8-12 copies for some regions). 
#Furthermore, because tumor samples are typically an admixture of normal and cancer cells, the tumor purity in unknown and variable.

#Different algorithms make different assumptions while handling somatic or germ line CNV. Typically, germ line cnv caller can assume:
#The genome is largely diploid.
#The sample is pure and homogeneous.
#Any gain or loss should be 50% move or 50% less coverage.
#For these reasons, the algorithms can focus more on associating p-values for each call; it is possible to estimate false positive and false negative rates.
#Somatic CNA callers cannot make any of the assumption above, or if they do, they have limited scope.

#CNVs can be experimentally detected based on comparative genomic hybridization, and computationally inferred from SNP-arrays or next-generation sequencing data. 
#These technologies for CNV detection have in common that they report, for each sample under study, genomic regions that are duplicated or deleted with respect to a reference.
#Such regions are denoted as CNV calls in the following and will be considered the starting point for analysis.
#CNVs can be experimentally detected based on comparative genomic hybridization, and computationally inferred from SNP-arrays or next-generation sequencing data. 
#These technologies for CNV detection have in common that they report, for each sample under study, genomic regions that are duplicated or deleted with respect to a reference. 
#Such regions are denoted as CNV calls and will be considered the starting point for analysis with the CNVRanger package.
#The CNVRanger package imports CNV calls from a simple file format into R, and stores them in dedicated Bioconductor data structures, 
#and implements three frequently used approaches for summarizing CNV calls across a population: 
#(i) the CNVRuler procedure that trims region margins based on regional density Kim et al., 2012, 
#(ii) the reciprocal overlap procedure that requires sufficient mutual overlap between calls Conrad et al., 2010, and 
#(iii) the GISTIC procedure that identifies recurrent CNV regions Beroukhim et al., 2007.
#CNVRanger builds on regioneR for overlap analysis of CNVs with functional genomic regions, and implements RNA-seq expression Quantitative Trait Loci (eQTL) analysis 
#for CNVs by interfacing with edgeR, 

#CNVRanger reads CNV calls from a simple file format, providing at least chromosome, start position, end position, sample ID, and integer copy number for each call.
#The last column contains the integer copy number state for each call, encoded as

#0: homozygous deletion (2-copy loss)
#1: heterozygous deletion (1-copy loss)
#2: normal diploid state
#3: 1-copy gain
#4: amplification (>= 2-copy gain)

#For CNV detection software that uses a different encoding, it is necessary to convert to the above encoding. For example, the GISTIC2 procedure that was used to 
#generate our Sumamrized Experiment CNV block, uses the following format which can be converted by simply adding 2:

#-2: homozygous deletion (2-copy loss)
#-1: heterozygous deletion (1-copy loss)
#0: normal diploid state
#1: 1-copy gain
#2: amplification (>= 2-copy gain)

#In CNV analysis, it is often of interest to summarize individual calls across the population, (i.e. to define CNV regions), for subsequent association analysis with expression 
#and phenotype data. In the simplest case, this just merges overlapping individual calls into summarized regions. However, this typically inflates CNV region size, 
#and more appropriate approaches have been developed for this purpose.There is need for quality control of CNV calls and appropriate accounting for sources of technical bias 
#before applying these summarization functions (or in general downstream analysis with CNVRanger).For instance, protocols for read-depth CNV calling typically exclude calls 
#overlapping defined repetitive and low-complexity regions including the UCSC list of segmental duplications Trost et al., 2018, Zhou et al., 2018. We also note that CNVnator, 
#a very popular read-depth CNV caller, implements the q0-filter to explicitely flag and, if desired, exclude calls that are likely to stem from such regions.
#If systematically over-represented in the input CNV calls, summarization procedures such as GISTIC will identify these regions as recurrent independent of whether there 
#are biological or technical reasons for that.In particular in cancer, it is important to distinguish driver from passenger mutations, i.e. to distinguish meaningful events from random background aberrations. 
#The GISTIC method identifies those regions of the genome that are aberrant more often than would be expected by chance, with greater weight given to high amplitude events 
#(high-level copy-number gains or homozygous deletions) that are less likely to represent random aberrations
#GISTIC is a tool to identify genes targeted by somatic copy number variation (CNV). The GISTIC algorithm defines CNV boundaries by a user-defined confidence level.

#Module Name:   GISTIC2
#Description:   Genomic Identification of Significant Targets in Cancer, version 2.0
#Authors:   Gad Getz, Rameen Beroukhim, Craig Mermel, Steve Schumacher and Jen Dobson
#Date:  27 Mar 2017
#Release:   2.0.23
#Software interface: Command-line user interface
#Language: Matlab
#Operating system: Linux

#The GISTIC module identifies regions of the genome that are significantly amplified or deleted across a set of samples. 
#Each aberration is assigned a G-score that considers the amplitude of the aberration as well as the frequency of its occurrence across samples. 
#False Discovery Rate q-values are then calculated for the aberrant regions, and regions with q-values below a user-defined threshold are considered significant. 
#For each significant region, a "peak region" is identified, which is the part of the aberrant region with greatest amplitude and frequency of alteration. 
#In addition, a "wide peak" is determined using a leave-one-out algorithm to allow for errors in the boundaries in a single sample. 
#The "wide peak" boundaries are more robust for identifying the most likely gene targets in the region. Each significantly aberrant region is also tested to 
#determine whether it results primarily from broad events (longer than half a chromosome arm), focal events, or significant levels of both. 
#The GISTIC module reports the genomic locations and calculated q-values for the aberrant regions. It identifies the samples that exhibit each significant 
#amplification or deletion, and it lists genes found in each "wide peak" region.

#According to website https://www.bioconductor.org/packages/release/bioc/vignettes/MultiAssayExperiment/inst/doc/QuickStartMultiAssay.html, 
#the assay matrix of our non-Genomic Range Summarized Experiment (gistict: SummarizedExperiment with 198 rows and 43 columns) 
#obtained via miniACC MUltiAssayExperiment represents the GISTIC genomic copy number by gene. This apparently is a summary of filtered and statistically 
#significant gene-based recurrent copy number lesions identified by GISTIC2 identified via the aforementioned GISTIC2 procedure

#DIFFERENTIAL CNV gistic peaks ACROSS YOUNG AND OLD PATIENTS:
#Exploring the SummarizedExperiemnt extracted from the initial miniACC MultiAssayExperiment:

#TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages

cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
#Alternatively:
mACC.CN3
## class: SummarizedExperiment 
## dim: 198 10 
## metadata(0):
## assays(1): ''
## rownames(198): DIRAS3 MAPK14 ... SQSTM1 KCNJ13
## rowData names(3): Gene.Symbol Locus.ID Cytoband
## colnames(10): TCGA-OR-A5J9-01A-11D-A29H-01 TCGA-OR-A5JE-01A-11D-A29H-01
##   ... TCGA-OR-A5LE-01A-11D-A29H-01 TCGA-OR-A5LL-01A-11D-A29H-01
## colData names(0):
#Creating a phenotype dataframe:
phenoN3 <- data.frame(sample=colnames(assay(mACC.CN3)),patientID=colData(miniACC.assays.comp.age)$patientID, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
rownames(phenoN3)<-phenoN3$sample 
cond2<-phenoN3$age.status
gistic.peaks <- as.matrix(assay(mACC.CN3))
sum(is.na(gistic.peaks))
## [1] 0
#As part of the exploration, we plot data
boxplot(gistic.peaks)  

boxplot(log2(gistic.peaks+2))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 2 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn

#Hierarchical clustering
x_cnv<-gistic.peaks

#Euclidean distance
clust.cor.ward <- hclust(dist(t(x_cnv)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)

#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

clust.cor.average <- hclust(dist(t(x_cnv)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)

#The average hierarchal clustering DOES NOT appear to reflect the segregation of 5 old and 5 young patients

clust.cor.average <- hclust(dist(t(x_cnv)),method="complete")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8)

#The complete hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

#Correlation based distance
clust.cor.ward <- hclust(as.dist(1-cor(x_cnv)),method="ward.D2")
plot(clust.cor.ward, main="hierarchical clustering", hang=-1,cex=0.8)

#The ward.D2 hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

clust.cor.average<- hclust(as.dist(1-cor(x_cnv)),method="average")
plot(clust.cor.average, main="hierarchical clustering", hang=-1,cex=0.8) 

#The average hierarchal clustering appears to reflect the segregation of 5 old and 5 young patients

sum1<-sum(is.na(gistic.peaks))
sum1
## [1] 0
#Density plot of gistic peaks (log10)
#gistic.peaks_log <- log(gistic.peaks,10) 
#d <- density(gistic.peaks_log)
#plot(d,xlim=c(1,8),main="",ylim=c(0,.45),xlab="Raw CNV gistic peaks per gene after log10 transformation)", ylab="Density")
#for (s in 1:length(colnames(gistic.peaks_log))){
#  gistic.peaks_log <- log(gistic.peaks[,s],10) 
#  d <- density(gistic.peaks_log)
#  lines(d)
#}
#Error in density.default(gistic.peaks_log) : 'x' contains missing values

#Box plots of raw gistic peaks after log10 transformation
gistic.peaks_log <- log(gistic.peaks,10)
## Warning: NaNs produced
boxplot(gistic.peaks_log , main="", xlab="", ylab="Raw CNV gistic peaks per gene after log10 transformation)",axes=FALSE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 5 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 6 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 7 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 8 is not drawn
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out =
## z$out[z$group == : Outlier (-Inf) in boxplot 9 is not drawn
axis(2)
axis(1,at=c(1:length(colnames(gistic.peaks_log))),labels=colnames(gistic.peaks_log),las=2,cex.axis=0.8)

#Plot Heatmap with condition age.status as labels
colnames(gistic.peaks)<-phenoN3$age.status 
heatmap(gistic.peaks, col = topo.colors(50), margin=c(10,6))

#patient is expressing many trcurrent genes lesions

#PCA
summary(pca.filt <- prcomp(t(x_cnv), scale=T )) 
## Importance of components:
##                          PC1   PC2   PC3   PC4    PC5    PC6    PC7    PC8
## Standard deviation     7.630 6.488 5.472 4.642 4.3106 3.2729 2.9623 2.1138
## Proportion of Variance 0.294 0.213 0.151 0.109 0.0939 0.0541 0.0443 0.0226
## Cumulative Proportion  0.294 0.507 0.658 0.767 0.8605 0.9146 0.9589 0.9815
##                           PC9     PC10
## Standard deviation     1.9156 2.56e-15
## Proportion of Variance 0.0185 0.00e+00
## Cumulative Proportion  1.0000 1.00e+00
autoplot(pca.filt, data=phenoN3, colour="patientID", shape="age.status")

#There does not appear to be segregation by age status
#Note that a total of  21.26%+ 29.4 %= 50.66% variance is accounted for by the 
#first 2 principal components PC1 and PC2 and corresponding eigenvector values


#GGBIO VISUALIZATION OF GISTIC COPY NUMBER VARIATION (CNV) RECCURENT REGIONS:  

hg19sub
## GRanges object with 22 ranges and 0 metadata columns:
##        seqnames      ranges strand
##           <Rle>   <IRanges>  <Rle>
##    [1]        1 1-249250621      *
##    [2]        2 1-243199373      *
##    [3]        3 1-198022430      *
##    [4]        4 1-191154276      *
##    [5]        5 1-180915260      *
##    ...      ...         ...    ...
##   [18]       18  1-78077248      *
##   [19]       19  1-59128983      *
##   [20]       20  1-63025520      *
##   [21]       21  1-48129895      *
##   [22]       22  1-51304566      *
##   -------
##   seqinfo: 22 sequences from hg19 genome
autoplot(hg19sub, layout = "circle", fill = "gray70")

#Use the same data to create ideogram, label and scale track, it layouts the circle by the
# order created from inside to outside
#p <- ggbio() + circle(hg19sub, geom = "ideo", fill = "gray70") +
#  circle(hg19sub, geom = "scale", size = 2) +
#  circle(hg19sub, geom = "text", aes(label = seqnames), 
#         vjust = 0, size = 3)
#p

# Then we add a "rectangle" track to show somatic CNV recurrent regions states which will  looks like vertical segments.
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
cnv_gr<-rowRanges(cnv_gistic)


p <- ggbio() + circle(cnv_gr, geom = "rect", color = "steelblue") +
  circle(hg19sub, geom = "ideo", fill = "gray70") +
  circle(hg19sub, geom = "scale", size = 2) +
  circle(hg19sub, geom = "text", aes(label = seqnames), 
         vjust = 0, size = 3)
p

#Because copy number variation analysis is not mentioned in the DESeq / DESeq2 manual or edgeR, we  don't use DESeq / DESeq2  for that purpose. 
#The data distribution of  CNV data will not match that expected by DESeq which expects a negative binomial distribution. 
#CNV data is measured as discrete intervals, and so something like a Hidden Markov Model (HMM) is more commonly employed although it can be measured on a continuous scale too.
#The "fundamental limitation" of trying to detect CNV from RNA-seq relates to the fact that a copy number event does not necessarily alter gene expression levels. 
#A gene could easily be duplicated, for example, but, without the promoter sequence and/or transcription start site (TSS), 
#it will not be expressed (or just expressed at negligible levels).EdgeR and DESeq2 can be used for ChIPSeq mostly for differential peak calling which is different from CNV. 
#Data is counts and distribution is in accordance with RNAseq.CNV calling with a DE tool having the assumption that data is normally distributed does not in 
#any way accord for finding CNV which works on discrete data. One needs to find the right tool and the right distribution for finding CNVs and there are plenty of 
#technology to produce the data and tools to generate copy profiles from those data. One important this is properly accounting for allelic frequencies 
#while scanning through the genome and then using segmentation for finding copy ratios. This cannot be done with DESeq2. 
#Most DE tools assume that the biological variation has a continuous distribution (e.g. normal or gamma), but variation due to CNV would be discrete at 
#integer multiples of the haploid coverage depth. 

#Other options: Window  the genome in to 10kb bins; Compute  the reads number in every bins;Normalize the sequence depth and make sure the CNV value in every bin are in the same scale to have a #valid comparison. Use HMMcopy to tackle GC bias; CNVkit uses normals to create a reference to which it'll compare each sample.
#CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both #whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent.

#DIFFERENTIAL GISTIC CNV ANALYSIS

#We preliminarily use simplified linear regression model to assess differences in GISTIC gene-based recurrent lesion copy number variation:
x_cnv_model<-x_cnv
colnames(x_cnv_model)<-cond2
 
x_cnv_model.t<-t(x_cnv_model)
x_cnv_model.t.df<-as.data.frame(x_cnv_model.t)
x_cnv_model.t.df$age.status<-as.factor(cond2)
#x_cnv_model.t.df
#Example of simple linear regression with single categorical variable factor age.status for first gene CNV:
summary(lm(x_cnv_model.t.df$DIRAS3 ~ x_cnv_model.t.df$age.status))
## 
## Call:
## lm(formula = x_cnv_model.t.df$DIRAS3 ~ x_cnv_model.t.df$age.status)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -0.8   -0.4    0.2    0.2    0.6 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                        -0.600      0.224   -2.68    0.028 *
## x_cnv_model.t.df$age.statusyoung    0.400      0.316    1.26    0.242  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5 on 8 degrees of freedom
## Multiple R-squared:  0.167,  Adjusted R-squared:  0.0625 
## F-statistic:  1.6 on 1 and 8 DF,  p-value: 0.242
#Evidently, with p-value=0.242, DIRAS3 gene GISTIC CNV does not appear to be related to age.status
#Now Run n regressions for all genes
 
my_lms <- lapply(1:((ncol(x_cnv_model.t.df))-1), function(x) lm(x_cnv_model.t.df[,x] ~ x_cnv_model.t.df$age.status))
# Extract just coefficients
sapply(my_lms, coef)
##                                  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## (Intercept)                      -0.6 -0.4 -0.2  0.4 -0.2 -0.2  0.2  0.4  0.2
## x_cnv_model.t.df$age.statusyoung  0.4  0.4 -0.2  0.2  0.0  0.4  0.2  0.2  0.4
##                                  [,10] [,11]    [,12]     [,13] [,14] [,15]
## (Intercept)                        0.4  -0.4 2.00e-01  2.00e-01  -0.6  -0.2
## x_cnv_model.t.df$age.statusyoung  -0.2   0.8 1.05e-16 -2.11e-16   0.4  -0.2
##                                      [,16]     [,17] [,18]   [,19] [,20] [,21]
## (Intercept)                       2.00e-01  2.00e-01   0.4 6.0e-01   0.2   0.4
## x_cnv_model.t.df$age.statusyoung -2.11e-16 -2.11e-16  -0.4 1.4e-16   0.4  -0.4
##                                     [,22]     [,23] [,24] [,25] [,26]    [,27]
## (Intercept)                      1.23e-16  7.02e-17   0.0   0.4  -0.6 -4.0e-01
## x_cnv_model.t.df$age.statusyoung 2.00e-01 -4.00e-01  -0.2  -0.4   0.2  1.4e-16
##                                     [,28] [,29] [,30] [,31] [,32] [,33] [,34]
## (Intercept)                      -4.0e-01   0.0  -0.6  -0.2  -0.2  -0.4   0.4
## x_cnv_model.t.df$age.statusyoung  1.4e-16  -0.2   0.2   0.0  -0.4   0.6  -0.2
##                                  [,35] [,36] [,37] [,38] [,39] [,40] [,41]
## (Intercept)                       -0.2  -0.6  -0.4   0.4   0.6   0.0  -0.4
## x_cnv_model.t.df$age.statusyoung   0.0   0.4   0.4  -0.4  -0.2  -0.2   0.4
##                                  [,42] [,43]    [,44] [,45] [,46]    [,47]
## (Intercept)                       -0.6   0.2 -4.0e-01  -0.4   0.0 3.51e-17
## x_cnv_model.t.df$age.statusyoung   0.4   0.4 -1.4e-16  -0.2  -0.2 4.00e-01
##                                  [,48]   [,49] [,50] [,51] [,52]    [,53] [,54]
## (Intercept)                       -0.2 6.0e-01   0.8   0.6   0.2 1.23e-16   0.8
## x_cnv_model.t.df$age.statusyoung  -0.2 1.4e-16  -0.2  -0.4   0.4 2.00e-01  -0.2
##                                  [,55] [,56] [,57] [,58] [,59] [,60]    [,61]
## (Intercept)                        0.4  -0.4   0.8   0.6  -0.2   0.4 6.00e-01
## x_cnv_model.t.df$age.statusyoung  -0.2   0.4  -0.2  -0.2  -0.4  -0.4 7.02e-17
##                                  [,62] [,63]    [,64] [,65]    [,66] [,67]
## (Intercept)                       -0.2   0.8 -4.0e-01   0.4 -4.0e-01   0.8
## x_cnv_model.t.df$age.statusyoung   0.0  -0.2 -1.4e-16  -0.4  1.4e-16  -0.4
##                                  [,68] [,69] [,70]   [,71] [,72] [,73] [,74]
## (Intercept)                        0.2   0.4   0.4 6.0e-01  -0.4  -0.4  -0.4
## x_cnv_model.t.df$age.statusyoung   0.2   0.2   0.2 1.4e-16  -0.2   0.4   0.0
##                                  [,75] [,76]   [,77] [,78]     [,79]     [,80]
## (Intercept)                        0.6  -0.4 6.0e-01  -0.6  3.51e-17  2.00e-01
## x_cnv_model.t.df$age.statusyoung  -0.2   0.4 1.4e-16   0.4 -4.00e-01 -2.11e-16
##                                  [,81]   [,82] [,83] [,84] [,85] [,86]   [,87]
## (Intercept)                       -0.2 6.0e-01  -0.2   0.4   0.4   0.6 6.0e-01
## x_cnv_model.t.df$age.statusyoung   0.0 1.4e-16   0.0  -0.4  -0.2  -0.2 1.4e-16
##                                  [,88] [,89] [,90] [,91] [,92]    [,93]   [,94]
## (Intercept)                        0.2  -0.6   0.4  -0.6   0.2 -4.0e-01 6.0e-01
## x_cnv_model.t.df$age.statusyoung   0.2   0.4   0.2   0.4   0.2 -1.4e-16 1.4e-16
##                                  [,95] [,96] [,97] [,98] [,99] [,100] [,101]
## (Intercept)                       -0.2  -0.2  -0.6  -0.4  -0.6    0.2    0.6
## x_cnv_model.t.df$age.statusyoung   0.0   0.0   0.4   1.0   0.4    0.4   -0.4
##                                  [,102]  [,103] [,104] [,105] [,106] [,107]
## (Intercept)                        -0.4 6.0e-01   -0.4    0.8    0.8    0.0
## x_cnv_model.t.df$age.statusyoung   -0.2 1.4e-16    0.0   -0.4   -0.2   -0.2
##                                    [,108] [,109] [,110] [,111] [,112] [,113]
## (Intercept)                      -4.0e-01    0.4    0.4   -0.2    0.8   -0.6
## x_cnv_model.t.df$age.statusyoung -1.4e-16    0.2   -0.2    0.2   -0.4    0.4
##                                  [,114]  [,115] [,116] [,117] [,118] [,119]
## (Intercept)                         0.8 6.0e-01    0.2    0.8    0.2    0.8
## x_cnv_model.t.df$age.statusyoung   -0.2 1.4e-16    0.4   -0.2    0.2   -0.2
##                                  [,120]    [,121] [,122]    [,123] [,124]
## (Intercept)                         0.6  2.00e-01   -0.4 -2.00e-01    0.4
## x_cnv_model.t.df$age.statusyoung   -0.4 -2.11e-16    0.4  1.76e-17   -0.4
##                                  [,125] [,126] [,127]  [,128] [,129]  [,130]
## (Intercept)                         0.8    0.4   -0.6 6.0e-01   -0.4 6.0e-01
## x_cnv_model.t.df$age.statusyoung   -0.2   -0.2    0.4 1.4e-16   -0.2 1.4e-16
##                                  [,131] [,132] [,133]   [,134] [,135] [,136]
## (Intercept)                        -0.6    0.4   -0.4 3.51e-17    0.2    0.8
## x_cnv_model.t.df$age.statusyoung    0.4    0.2    0.0 2.00e-01    0.2   -0.2
##                                  [,137] [,138] [,139] [,140] [,141]   [,142]
## (Intercept)                         0.4    0.4    0.8   -0.4   -0.6 4.00e-01
## x_cnv_model.t.df$age.statusyoung   -0.4    0.2   -0.4   -0.2    0.4 7.02e-17
##                                    [,143] [,144] [,145] [,146] [,147] [,148]
## (Intercept)                      2.00e-01    0.4    0.4   -0.2   -0.4    0.8
## x_cnv_model.t.df$age.statusyoung 1.05e-16    0.2    0.2    0.0   -0.2   -0.2
##                                  [,149]    [,150] [,151]  [,152] [,153]
## (Intercept)                         0.4  2.00e-01    0.0 6.0e-01    0.4
## x_cnv_model.t.df$age.statusyoung    0.2 -2.11e-16   -0.2 1.4e-16    0.2
##                                     [,154] [,155]    [,156] [,157] [,158]
## (Intercept)                      -1.05e-16   -0.2 -1.58e-16    0.6   -0.2
## x_cnv_model.t.df$age.statusyoung  4.00e-01    0.0  2.00e-01   -0.4   -0.2
##                                  [,159]  [,160] [,161] [,162] [,163] [,164]
## (Intercept)                        -0.2 6.0e-01    0.8    0.2    0.4    0.2
## x_cnv_model.t.df$age.statusyoung    0.2 1.4e-16   -0.2    0.2   -0.4    0.2
##                                  [,165] [,166] [,167] [,168] [,169] [,170]
## (Intercept)                         0.2    0.2   -0.4   -0.6   -0.6   -0.4
## x_cnv_model.t.df$age.statusyoung    0.4    0.4   -0.2    0.4    0.4    0.4
##                                  [,171] [,172]    [,173] [,174] [,175] [,176]
## (Intercept)                         0.4    0.6  7.02e-17   -0.4    0.4    0.4
## x_cnv_model.t.df$age.statusyoung    0.2   -0.4 -2.00e-01   -0.2    0.2   -0.4
##                                  [,177] [,178]    [,179] [,180]   [,181] [,182]
## (Intercept)                         0.4   -0.4 -8.78e-17    0.2 4.00e-01   -0.2
## x_cnv_model.t.df$age.statusyoung    0.2    0.2  2.00e-01    0.4 7.02e-17    0.0
##                                  [,183] [,184] [,185]   [,186]   [,187]
## (Intercept)                        -0.4    0.4   -0.2 2.00e-01 3.51e-17
## x_cnv_model.t.df$age.statusyoung    0.4   -0.2    0.2 1.05e-16 2.00e-01
##                                     [,188]  [,189] [,190]    [,191] [,192]
## (Intercept)                       2.00e-01 6.0e-01    0.4 -1.05e-16   -0.2
## x_cnv_model.t.df$age.statusyoung -2.11e-16 1.4e-16   -0.2  4.00e-01   -0.2
##                                  [,193] [,194]   [,195] [,196] [,197]   [,198]
## (Intercept)                         0.4    0.4 1.23e-16      0    0.8 1.23e-16
## x_cnv_model.t.df$age.statusyoung    0.2   -0.4 2.00e-01      0   -0.2 2.00e-01
#For more info, get full summary call:
summaries <- lapply(my_lms, summary)
#Coefficents with p values:
p_values<-lapply(summaries, function(x) x$coefficients[, c(1,4)])
 
#Evidently, the lowest p-value of 0.0656 was obtained from list item index#98
gene_cnv<-colnames(x_cnv_model.t.df[98])
gene_cnv
## [1] "FOXO3"
#The gene that had the lowest p-value for differential GISTIC cnv value with respect to young/old age.status is FOXO3.

#r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared, adj_r_sq = x$adj.r.squared))
##            [,1]      [,2]    [,3]    [,4]      [,5]  [,6]    [,7]    [,8]
## r_sq     0.1667  1.11e-01  0.0476  0.0222  1.64e-32  0.04  0.0244  0.0222
## adj_r_sq 0.0625 -2.22e-16 -0.0714 -0.1000 -1.25e-01 -0.08 -0.0976 -0.1000
##             [,9]   [,10] [,11]     [,12]     [,13]  [,14]   [,15]     [,16]
## r_sq      0.0909  0.0476 0.267  2.27e-32  1.83e-32 0.1667  0.0476  1.83e-32
## adj_r_sq -0.0227 -0.0714 0.175 -1.25e-01 -1.25e-01 0.0625 -0.0714 -1.25e-01
##              [,17] [,18]     [,19]   [,20] [,21]   [,22] [,23]   [,24] [,25]
## r_sq      1.83e-32 0.111  5.65e-32  0.0909 0.111  0.0345 0.111  0.0345 0.111
## adj_r_sq -1.25e-01 0.000 -1.25e-01 -0.0227 0.000 -0.0862 0.000 -0.0862 0.000
##          [,26]     [,27]     [,28]   [,29] [,30]     [,31]   [,32] [,33]
## r_sq      0.04  2.57e-32  2.57e-32  0.0345  0.04  1.64e-32  0.0625 0.310
## adj_r_sq -0.08 -1.25e-01 -1.25e-01 -0.0862 -0.08 -1.25e-01 -0.0547 0.224
##            [,34]     [,35]  [,36]     [,37] [,38]   [,39]   [,40] [,41]  [,42]
## r_sq      0.0476  1.64e-32 0.1667  1.11e-01 0.111  0.0154  0.0345 0.250 0.1667
## adj_r_sq -0.0714 -1.25e-01 0.0625 -2.22e-16 0.000 -0.1077 -0.0862 0.156 0.0625
##            [,43]     [,44] [,45]   [,46]   [,47]   [,48]     [,49]   [,50]
## r_sq      0.0909  2.05e-32  0.04  0.0345  0.0526  0.0476  7.70e-32  0.0476
## adj_r_sq -0.0227 -1.25e-01 -0.08 -0.0862 -0.0658 -0.0714 -1.25e-01 -0.0714
##           [,51]   [,52]   [,53]   [,54]   [,55] [,56]   [,57]   [,58]   [,59]
## r_sq     0.1667  0.0909  0.0345  0.0476  0.0476 0.250  0.0476  0.0154  0.0909
## adj_r_sq 0.0625 -0.0227 -0.0862 -0.0714 -0.0714 0.156 -0.0714 -0.1077 -0.0227
##          [,60]     [,61]     [,62]   [,63]     [,64] [,65]     [,66]  [,67]
## r_sq     0.250  7.70e-32  1.64e-32  0.0476  1.05e-31 0.111  1.93e-32 0.1667
## adj_r_sq 0.156 -1.25e-01 -1.25e-01 -0.0714 -1.25e-01 0.000 -1.25e-01 0.0625
##            [,68]   [,69]   [,70]     [,71] [,72]     [,73]  [,74]   [,75] [,76]
## r_sq      0.0244  0.0222  0.0222  7.70e-32  0.04  1.11e-01  0.000  0.0222 0.250
## adj_r_sq -0.0976 -0.1000 -0.1000 -1.25e-01 -0.08 -2.22e-16 -0.125 -0.1000 0.156
##              [,77]  [,78]   [,79]     [,80]     [,81]     [,82]     [,83] [,84]
## r_sq      7.70e-32 0.1667  0.0714  1.83e-32  1.64e-32  3.92e-32  1.64e-32 0.250
## adj_r_sq -1.25e-01 0.0625 -0.0446 -1.25e-01 -1.25e-01 -1.25e-01 -1.25e-01 0.156
##            [,85]   [,86]     [,87]   [,88]  [,89]   [,90]  [,91]   [,92]
## r_sq      0.0476  0.0154  7.70e-32  0.0244 0.1667  0.0222 0.1667  0.0244
## adj_r_sq -0.0714 -0.1077 -1.25e-01 -0.0976 0.0625 -0.1000 0.0625 -0.0976
##              [,93]     [,94]     [,95]     [,96]  [,97] [,98]  [,99]  [,100]
## r_sq      2.05e-32  3.92e-32  1.64e-32  1.64e-32 0.1667 0.362 0.1667  0.0909
## adj_r_sq -1.25e-01 -1.25e-01 -1.25e-01 -1.25e-01 0.0625 0.283 0.0625 -0.0227
##          [,101] [,102]    [,103] [,104] [,105]  [,106]  [,107]    [,108]
## r_sq     0.1667   0.04  5.65e-32  0.000 0.1667  0.0476  0.0345  2.05e-32
## adj_r_sq 0.0625  -0.08 -1.25e-01 -0.125 0.0625 -0.0714 -0.0862 -1.25e-01
##           [,109]  [,110]  [,111] [,112] [,113]  [,114]    [,115]  [,116]
## r_sq      0.0222  0.0476  0.0345 0.1667 0.1667  0.0476  5.65e-32  0.0909
## adj_r_sq -0.1000 -0.0714 -0.0862 0.0625 0.0625 -0.0714 -1.25e-01 -0.0227
##           [,117]  [,118]  [,119] [,120]    [,121] [,122]    [,123] [,124]
## r_sq      0.0476  0.0244  0.0476 0.1667  1.83e-32  0.250  7.22e-33  0.111
## adj_r_sq -0.0714 -0.0976 -0.0714 0.0625 -1.25e-01  0.156 -1.25e-01  0.000
##           [,125]  [,126] [,127]    [,128] [,129]    [,130] [,131]  [,132]
## r_sq      0.0476  0.0164 0.1667  5.65e-32   0.04  5.65e-32 0.1667  0.0222
## adj_r_sq -0.0714 -0.1066 0.0625 -1.25e-01  -0.08 -1.25e-01 0.0625 -0.1000
##          [,133]  [,134]  [,135]  [,136] [,137]  [,138] [,139] [,140] [,141]
## r_sq      0.000  0.0345  0.0244  0.0476  0.111  0.0222 0.1667   0.04 0.1667
## adj_r_sq -0.125 -0.0862 -0.0976 -0.0714  0.000 -0.1000 0.0625  -0.08 0.0625
##             [,142]    [,143]  [,144]  [,145]    [,146] [,147]  [,148]  [,149]
## r_sq      8.40e-33  2.27e-32  0.0222  0.0118  1.64e-32   0.04  0.0476  0.0222
## adj_r_sq -1.25e-01 -1.25e-01 -0.1000 -0.1118 -1.25e-01  -0.08 -0.0714 -0.1000
##             [,150]  [,151]    [,152]  [,153]  [,154]    [,155]  [,156] [,157]
## r_sq      1.83e-32  0.0345  5.65e-32  0.0222  0.0714  1.64e-32  0.0145 0.1667
## adj_r_sq -1.25e-01 -0.0862 -1.25e-01 -0.1000 -0.0446 -1.25e-01 -0.1087 0.0625
##           [,158]  [,159]    [,160]  [,161]  [,162] [,163]  [,164]  [,165]
## r_sq      0.0476  0.0345  5.65e-32  0.0476  0.0244  0.111  0.0123  0.0909
## adj_r_sq -0.0714 -0.0862 -1.25e-01 -0.0714 -0.0976  0.000 -0.1111 -0.0227
##           [,166] [,167] [,168] [,169]    [,170]  [,171] [,172]  [,173] [,174]
## r_sq      0.0909   0.04 0.1667 0.1667  1.11e-01  0.0222 0.1667  0.0204   0.04
## adj_r_sq -0.0227  -0.08 0.0625 0.0625 -2.22e-16 -0.1000 0.0625 -0.1020  -0.08
##           [,175] [,176]  [,177]  [,178]  [,179]  [,180]    [,181]    [,182]
## r_sq      0.0118  0.111  0.0118  0.0476  0.0204  0.0909  8.40e-33  1.64e-32
## adj_r_sq -0.1118  0.000 -0.1118 -0.0714 -0.1020 -0.0227 -1.25e-01 -1.25e-01
##          [,183]  [,184]  [,185]    [,186]  [,187]    [,188]    [,189]  [,190]
## r_sq      0.250  0.0476  0.0345  2.27e-32  0.0345  1.83e-32  5.65e-32  0.0476
## adj_r_sq  0.156 -0.0714 -0.0862 -1.25e-01 -0.0862 -1.25e-01 -1.25e-01 -0.0714
##           [,191]  [,192]  [,193] [,194]  [,195]    [,196]  [,197]  [,198]
## r_sq      0.0714  0.0123  0.0222  0.250  0.0345  8.65e-32  0.0476  0.0345
## adj_r_sq -0.0446 -0.1111 -0.1000  0.156 -0.0862 -1.25e-01 -0.0714 -0.0862
#The models are stored in a list, where model 3 is in my_lms[[3]] and so on.

#plot(x_cnv_model.t.df[,x], pch = 16, col = "blue") #Plot the results
#abline(lmTemp) #Add a regression line
#summary(lmTemp)
#plot(lmTemp$residuals, pch = 16, col = "red")


#We now explore reading and processing GISTIC files and data via 3 alternative approaches (maftools, readGISTIC, drug_prediction)
#before later treating our GISTIC recurrent lesion matrix from miniACC as individual call matrix to be then used in CNVRanger function approach:
#TESTING MODIFIED DRUG PREDICTION FUNCTIONS TO PROCESS CNV GISTIC DATA
cnvs_drug<-as.data.frame(rowRanges(cnv_gistic))
cnv_df<-as.data.frame(assay(cnv_gistic))
#Make sure rownames() are samples, and colnames() are genes by transposing dataframe.
cnv_df.t<-t(cnv_df)
#Determine the number of samples we want the CNVs to be amplified in. The default is 10.
n=10
#Indicate whether or not we want to test cnv data. If TRUE, we will test cnv data. If FALSE, we will test mutation data.
cnv=TRUE
wd<-tempdir()
savedir<-setwd(wd)
#Apply map_cnv() function to produce the file map.RData, which stores the object 'theCnvQuantVecList_mat'
#map_cnv(Cnvs=cnvs_drug)
#Error in map_cnv(Cnvs = cnvs_drug) : 
#ERROR: Check colnames() of cnv data. colnames() must include Sample, Chromosome, Start, End, and Segment_Mean
#>   403 genes were dropped because they have exons located on both strands
#>   of the same reference sequence or on more than one reference sequence,
#>   so cannot be represented by a single genomic range.
#>   Use 'single.strand.genes.only=FALSE' to get all the genes in a
#>   GRangesList object, or use suppressMessages() to suppress this message.

#load('map.RData') #This loads the object 'theCnvQuantVecList_mat', which was obtained using map_cnv()
#Make sure this data is a data frame and that colnames() are samples.
#data<-as.data.frame(t(theCnvQuantVecList_mat))
#samps<-colnames(data)
#colnames(data)<-substr(samps,1,nchar(samps)-12)
#Apply idwas()#Apply idwas() to test each cnv and each drug. The p-values and beta-values for each test will be exported 
#idwas(drug_prediction=cnv_df.t , data=data, n=n, cnv=cnv)
#THIS APPROACH YIELDED ERRORS DURING EXECUTION AND WAS ABANDONED 


#TESTING READ TCGA ACC GISTIC DATA DIRECTLY USING SPECIALIZED FUNCTIONS FOR GISTIC S4Vector OBJECT and summarize output files generated by GISTIC programme:

#The readGistic function can take above files provided manually, or a directory containing GISTIC results and import all the relevant files:
#readGistic(gisticAllLesionsFile = NULL,gisticAmpGenesFile = NULL,gisticDelGenesFile = NULL,gisticScoresFile = NULL,cnLevel = "all",isTCGA = FALSE,verbose = TRUE)

#Arguments
#gisticAllLesionsFile   = All Lesions file generated by gistic. e.g; all_lesions.conf_XX.txt, where XX is the confidence level. Required. Default NULL.
#gisticAmpGenesFile=Amplification Genes file generated by gistic. e.g; amp_genes.conf_XX.txt, where XX is the confidence level. Default NULL.
#gisticDelGenesFile=Deletion Genes file generated by gistic. e.g; del_genes.conf_XX.txt, where XX is the confidence level. Default NULL.
#gisticScoresFile=scores.gistic file generated by gistic.
#cnLevel    = level of CN changes to use. Can be 'all', 'deep' or 'shallow'. Default uses all i.e, genes with both 'shallow' or 'deep' CN changes
#isTCGA= Is the data from TCGA. Default FALSE.
#verbose= Default TRUE

#Evidently, We REQUIRE the first of four files that are generated by GISTIC: i.e, all_lesions.conf_XX.txt. 
#Based on the Sakar Khan's following youtube video Copy Number Variation Analysis using GISTIC - Tutorial :https://www.youtube.com/watch?v=Ssw7Ryao1x4&t=30s
#and based on the following website url https://www.genepattern.org/modules/docs/GISTIC_2.0#gsc.tab=0
#The format for this initial file includes the following columns:

#All Lesions File (all_lesions.conf_XX.txt, where XX is the confidence level)
#The all lesions file summarizes the results from the GISTIC run. It contains data about the significant regions of amplification and deletion as well as which samples are amplified or deleted in each of these regions. The identified regions are listed down the firstcolumn, and the samples are listed across the first row, starting in column 10.
#Region Data
#Columns 1-9 present the data about the significant regions as follows:
#Unique Name: A name assigned to identify the region.
#Descriptor: The genomic descriptor of that region
#Wide Peak Limits: The “wide peak” boundaries most likely to contain the targeted genes. These are listed in genomic coordinates and marker (or probe) indices.
#Peak Limits: The boundaries of the region of maximal amplification or deletion.
#Region Limits: The boundaries of the entire significant region of amplification or deletion.
#q values: The q-value of the peak region.
#Residual q values after removing segments shared with higher peaks : The q-value of the peak region after removing (“peeling off”) amplifications or deletions that overlap other more significant peak regions in the same chromosome.
#Broad or Focal: Identifies whether the region reaches significance due primarily to broad events (called “broad”), focal events (called “focal”), or independently significant broad and focal events (called “both”).
#Amplitude Threshold: Key giving the meaning of values in the subsequent columns associated with each sample.

#all-data-by_genes.txt=Gene Symbol, Gene ID (Entrez), Cytoband, SampleIDs

#To obtain these aforementioned files in appropriate format, we examine our previously generatd RangedSummarizedExperiment:
cnv_gistic_calls<-as.data.frame(assay(cnv_gistic))
#This is not appropriate format for CNVRanger functions. We can either create an appropriate dataframe and Genomic Ranges List Object using the 
#gistic assay CNV gistic recurrent lesion regions calls matrix in appropriate format for further analysis OR we can try to download the file in appropriate format as follows:

query <- GDCquery(project = "TCGA-ACC",data.category = "Copy Number Variation",data.type = "Copy Number Segment",
                                        barcode = c("TCGA-OR-A5J9-01A-11D-A29H-01","TCGA-OR-A5JE-01A-11D-A29H-01","TCGA-OR-A5JF-01A-11D-A29H-01","TCGA-OR-A5JI-01A-11D-A29H-01",
                                                     "TCGA-OR-A5K0-01A-11D-A29H-01","TCGA-OR-A5KV-01A-11D-A29H-01","TCGA-OR-A5L5-01A-11D-A29H-01","TCGA-OR-A5LC-01A-11D-A29H-01","TCGA-OR-A5LE-01A-11D-A29H-01","TCGA-OR-A5LL-01A-11D-A29H-01"),
                                        sample.type = c("Primary Tumor"))
## --------------------------------------
## o GDCquery: Searching in GDC database
## --------------------------------------
## Genome of reference: hg38
## --------------------------------------------
## oo Accessing GDC. This might take a while...
## --------------------------------------------
## ooo Project: TCGA-ACC
## --------------------
## oo Filtering results
## --------------------
## ooo By data.type
## ooo By barcode
## ooo By sample.type
## ----------------
## oo Checking data
## ----------------
## ooo Checking if there are duplicated cases
## ooo Checking if there are results for the query
## -------------------
## o Preparing output
## -------------------
GDCdownload(query)
## Downloading data for project TCGA-ACC
## GDCdownload will download 10 files. A total of 338.142 KB
## Downloading as: Thu_Jul_11_03_27_05_2024.tar.gz
## Downloading: 8.2 kB     Downloading: 8.2 kB     Downloading: 8.2 kB     Downloading: 8.2 kB     Downloading: 8.2 kB     Downloading: 8.2 kB     Downloading: 33 kB     Downloading: 33 kB     Downloading: 33 kB     Downloading: 33 kB     Downloading: 41 kB     Downloading: 41 kB     Downloading: 57 kB     Downloading: 57 kB     Downloading: 66 kB     Downloading: 66 kB     Downloading: 74 kB     Downloading: 74 kB     Downloading: 81 kB     Downloading: 81 kB     Downloading: 81 kB     Downloading: 81 kB     Downloading: 81 kB     Downloading: 81 kB
data <- GDCprepare(query)
## Reading copy number variation files
data
## # A tibble: 4,895 × 7
##    GDC_Aliquot           Chromosome  Start    End Num_Probes Segment_Mean Sample
##    <chr>                 <chr>       <dbl>  <dbl>      <dbl>        <dbl> <chr> 
##  1 726707fb-2431-4598-a… 1          6.29e4 1.88e6        304       -0.337 TCGA-…
##  2 726707fb-2431-4598-a… 1          1.88e6 4.67e6       1366       -1.14  TCGA-…
##  3 726707fb-2431-4598-a… 1          4.67e6 4.67e6          2       -7.19  TCGA-…
##  4 726707fb-2431-4598-a… 1          4.68e6 5.71e6        867       -1.06  TCGA-…
##  5 726707fb-2431-4598-a… 1          5.71e6 5.72e6         10       -2.47  TCGA-…
##  6 726707fb-2431-4598-a… 1          5.72e6 9.26e6       2014       -1.02  TCGA-…
##  7 726707fb-2431-4598-a… 1          9.27e6 1.69e7       4033       -0.215 TCGA-…
##  8 726707fb-2431-4598-a… 1          1.69e7 1.70e7         66       -0.691 TCGA-…
##  9 726707fb-2431-4598-a… 1          1.70e7 2.53e7       5229       -0.212 TCGA-…
## 10 726707fb-2431-4598-a… 1          2.53e7 2.53e7         24        0.399 TCGA-…
## # ℹ 4,885 more rows
#Get the last run dates
lastRunDate <- getFirehoseRunningDates()[1]
lastAnalyseDate <- getFirehoseAnalyzeDates(1)
#Download GISTIC results
gistic <- getFirehoseData("ACC",gistic2_Date = getFirehoseRunningDates()[1]) #"20141017"
## RTCGAToolbox cache directory set to:
##     C:\Users\User\AppData\Local/R/cache/R/RTCGAToolbox
## Using locally cached version of C:\Users\User\AppData\Local/R/cache/R/RTCGAToolbox/20160128-ACC-Clinical.txt
# get GISTIC results
gistic.allbygene <- gistic@GISTIC@AllByGene
#gistic.thresholedbygene <- gistic@GISTIC@ThresholedByGene
#Error: no slot of name "ThresholedByGene" for this object of class "FirehoseGISTIC"
gistic.allbygene 
## data frame with 0 columns and 0 rows
#FOR ULTIMATELY USING CNVRanger package to convert individual calls into the GISTIC recurrent regions lesions we obtained via miniACC, 
#WE ARE NOT SUCCESSFULLY OBTAINING THE NECESSARY GISTIC FILES WITH DATA VIA THE TCGA TOOLS

#ALTERNATIVELY, WE TRY TO OBTAIN THE NECESSARY FILES VIA maftools package:
#With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected. 
#The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200. 
#Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format (MAF): 
gistic_res_folder <- system.file("extdata", package = "maftools")
laml.gistic = readGistic(gisticDir = gistic_res_folder, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gistic <- system.file("extdata", "scores.gistic", package = "maftools")
laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gistic, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
#gistic_maftools <- readGistic(gisticAllLesionsFile = "all_lesions.conf_99.txt", 
#                              gisticAmpGenesFile = "amp_genes.conf_99.txt", 
#                             gisticDelGenesFile = "del_genes.conf_99.txt", 
#                              cnLevel = "all", gisticScoresFile = "scores.gistic")
#Error: File 'all_lesions.conf_99.txt' does not exist or is non-readable. getwd()=='C:/Users/User/Documents'
 

#There are three types of plots available to visualize gistic results:
#genome plot
gisticChromPlot(gistic = laml.gistic, markBands = "all")

#Co-gisticChromPlot
#Similarly, two GISTIC objects can be plotted side-by-side for cohort comparison. In this example, the same GISTIC object is used for demonstration.
coGisticChromPlot(gistic1 = laml.gistic, gistic2 = laml.gistic, g1Name = "AML-1", g2Name = "AML-2", type = 'Amp')
#oncoplot
#This is similar to oncoplots except for copy number data. One can again sort the matrix according to annotations, if any. Below plot is the gistic results for LAML, sorted according to FAB classification. Plot shows that 7q deletions are virtually absent in M4 subtype where as it is widespread in other subtypes.
#gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)
#Error in h(simpleError(msg, call)) : 
#error in evaluating the argument 'x' in selecting a method for function 'getClinicalData': object 'laml' not found
#Similar to MAF objects, there are methods available to access slots of GISTIC object - getSampleSummary, getGeneSummary and getCytoBandSummary. 
#Summarized results can be written to output files using function write.GisticSummary.

#BECAUSE WE DID NOT  OBTAIN THE NECESSARY ALL-CNV LESIONS FILE, WE WILL NOW ASSUME THAT OUR miniACC GISTIC matrix represents INDIVIDUAL CNV CALLS 
#TO BE CONVERTED TO GISTIC RECURRENT REGIONS VIA CNVRANGER. BECAUSE WE WERE UNSUCCESSFUL IN PROCESSING THE ADDITIONAL CNV INDIVIDUAL CALLS RAGGED EXPERIMENT 
#FROM TCGA, WE WILL INSTEAD TREAT THE GISTIC EXPERIMENT PROVIDED VIA miniACC AS QUANTIFICATION OF GENE-BASED INTEGER STATE COUNTS FOR RECURRING CNV LESIONS AND, 
#TREAT INSTEAD THE GISTIC REGIONS AS THE "INDIVIDUAL CNV CALLS" THAT WE WILL THEN CONVERT INTO GENOMIC RANGE LIST OBJECT, READ IN BY CNV_RANGER, 
#AND AGAIN PROCESS BY GISTIC2 TO YIELD THE STATISITCALLY SIGNIFICANT IDENTIFIED CHROMOSOME-WIDE RECURRENT REGIONS:

#CREATING AN INDIVIDUAL CALL-LIKE INPUT GENOMICRANGELIST OBJECT FOR CNVRAnger USING OUR TCGA GISTIC SUMMARIZED EXPERIMENT:

gensInfo_CNV<-getBM(attributes=c("hgnc_symbol","ensembl_gene_id","entrezgene_id","chromosome_name","start_position","end_position","description" ), filters=c("hgnc_symbol"), values=list(rownames(assay(mACC.CN3))), mart=ensembl102)
gensInfo_CNV$length <- gensInfo_CNV$end_position - gensInfo_CNV$start_position
range(gensInfo_CNV$length)
## [1]    2403 1216444
table(duplicated(gensInfo_CNV$hgnc_symbol))  
## 
## FALSE  TRUE 
##   197    15
gensInfo_CNV[duplicated(gensInfo_CNV$hgnc_symbol),]
##     hgnc_symbol ensembl_gene_id entrezgene_id         chromosome_name
## 2         ACACA ENSG00000278540            31                      17
## 10         AKT3 ENSG00000117020         10000                       1
## 51         CHGA ENSG00000100604          1113                      14
## 53        CLDN7 ENSG00000181885          1366                      17
## 63        EEF2K ENSG00000103319         29904                      16
## 86       HSPA1A ENSG00000234475          3303 CHR_HSCHR6_MHC_DBB_CTG1
## 87       HSPA1A ENSG00000237724          3303 CHR_HSCHR6_MHC_COX_CTG1
## 88       HSPA1A ENSG00000215328          3303 CHR_HSCHR6_MHC_QBL_CTG1
## 89       HSPA1A ENSG00000204389          3303                       6
## 113        MAPT ENSG00000276155          4137      CHR_HSCHR17_1_CTG5
## 114        MAPT ENSG00000186868          4137                      17
## 122       MYH11 ENSG00000133392          4629                      16
## 153        PTEN ENSG00000171862          5728                      10
## 170     RPS6KA1 ENSG00000117676          6195                       1
## 211       YWHAE ENSG00000108953          7531                      17
##     start_position end_position
## 2         37084992     37406836
## 10       243488233    243851079
## 51        92923150     92935285
## 53         7259903      7263983
## 63        22206278     22288738
## 86        31797650     31800132
## 87        31802834     31805316
## 88        31805699     31808181
## 89        31815543     31817946
## 113       46069784     46203150
## 114       45894551     46028334
## 122       15703135     15857028
## 153       87863625     87971930
## 170       26529761     26575030
## 211        1344275      1400222
##                                                                                                            description
## 2                                                        acetyl-CoA carboxylase alpha [Source:HGNC Symbol;Acc:HGNC:84]
## 10                                                     AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]
## 51                                                                   chromogranin A [Source:HGNC Symbol;Acc:HGNC:1929]
## 53                                                                        claudin 7 [Source:HGNC Symbol;Acc:HGNC:2049]
## 63                                           eukaryotic elongation factor 2 kinase [Source:HGNC Symbol;Acc:HGNC:24615]
## 86                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 87                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 88                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 89                                    heat shock protein family A (Hsp70) member 1A [Source:HGNC Symbol;Acc:HGNC:5232]
## 113                                              microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 114                                              microtubule associated protein tau [Source:HGNC Symbol;Acc:HGNC:6893]
## 122                                                           myosin heavy chain 11 [Source:HGNC Symbol;Acc:HGNC:7569]
## 153                                                  phosphatase and tensin homolog [Source:HGNC Symbol;Acc:HGNC:9588]
## 170                                                 ribosomal protein S6 kinase A1 [Source:HGNC Symbol;Acc:HGNC:10430]
## 211 tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein epsilon [Source:HGNC Symbol;Acc:HGNC:12851]
##     length
## 2   321844
## 10  362846
## 51   12135
## 53    4080
## 63   82460
## 86    2482
## 87    2482
## 88    2482
## 89    2403
## 113 133366
## 114 133783
## 122 153893
## 153 108305
## 170  45269
## 211  55947
length(setdiff(rownames(assay(mACC.CN3)), gensInfo_CNV$hgnc_symbol)) 
## [1] 1
countsFDF_CNV <- data.frame(ID=rownames(assay(mACC.CN3)),assay(mACC.CN3))
countsFInfo_CNV <- right_join(countsFDF_CNV, gensInfo_CNV, by=c("ID"="hgnc_symbol")) 
countsFInfo_CNV <- countsFInfo_CNV[!duplicated(countsFInfo_CNV$ID),] #After having checked duplications, just keep first result

countsFInfo_CNV_backup<-countsFInfo_CNV
colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'chromosome_name'] <- 'chr'
colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'start_position'] <- 'start'
colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'end_position'] <- 'end'
#colnames(countsFInfo_CNV_backup)[colnames(countsFInfo_CNV_backup) == 'chromosome_name'] <- 'state'

 
#REPLACING THE FOLLOWING INCORRECTLY FORMATTED CHROMOSOME NAMES OBTAINED VIA BIOMART WITH THE CORRECTLY FORMATTED CHROMOSOME 
#LOCATIONS FROM NCBI GENE DATABASE AND/OR UCSC GENOMIC BROWSER:


#14   RPS6KA1  CHR_HG2058_PATCH       26529761     26575030  = CHROMOSOME 1
#21   AKT3  CHR_HSCHR1_3_CTG32_1      243488233    243855434 = CHROMOSOME 1
#29   CLDN7      CHR_HG2087_PATCH        7259903      7263983 = CHROMOSOME 17
#36   PTEN  CHR_HG2334_PATCH       87863440     87966341 = CHROMOSOME 10
#69   YWHAE   CHR_HSCHR17_2_CTG2        1247054      1303157 =  CHROMOSOME 17
#85   MAPT      CHR_HSCHR17_2_CTG5       45906010     46039943 = CHROMOSOME 17
#102  ACACA CHR_HSCHR17_7_CTG4       37086456     37411442 = CHROMOSOME 17
#119  EEF2K    CHR_HG926_PATCH       21992621     22075070 = CHROMOSOME 16
#147  MYH11     CHR_HSCHR16_1_CTG1       15788326     15942169 = CHROMOSOME 16
#179  HSPA1A CHR_HSCHR6_MHC_APD_CTG1       31882493     31884975 = CHROMOSOME 6
#208  CHGA  CHR_HSCHR14_7_CTG1       92923080     92935293 = CHROMOSOME 14

#countsFInfo_CNV_backup %>% mutate(chr = ifelse(ID == "RPS6KA1", "1" , chr))
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "RPS6KA1", "chr"] <- "1"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "AKT3", "chr"] <- "1"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "CLDN7", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "PTEN", "chr"] <- "10"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "YWHAE", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "MAPT", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "ACACA", "chr"] <- "17"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "EEF2K", "chr"] <- "16"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "MYH11", "chr"] <- "16"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "HSPA1A", "chr"] <- "6"
countsFInfo_CNV_backup[countsFInfo_CNV_backup$ID == "CHGA", "chr"] <- "14"

#Extracting and subsetting the  mRNA-seq count matrix (e.g. filtered, No-NA, chromosome name-corrected, no-duplicate geneID row) 
#from the summary dataframe countsFInfo_backup
rownames(countsFInfo_CNV_backup)<-countsFInfo_CNV_backup$ID
#PCA for CNV
countsFInfo_CNV_backup_PCAMFA<-countsFInfo_CNV_backup[,2:11]
 
#Transpose
countsFInfo_CNV_backup_PCAMFA.t<-t(countsFInfo_CNV_backup_PCAMFA)
# assign names, we include a cnv suffix to differentiate genes from micexp or exp
colnames(countsFInfo_CNV_backup_PCAMFA.t)<-paste(countsFInfo_CNV_backup$ID,"cnv",sep=".")
#Construct data.frame to perform PCA
cnv4pca<-data.frame(cond2,countsFInfo_CNV_backup_PCAMFA.t)
res.pca.cnv<-PCA(cnv4pca,quali.sup=1)
res.pca.cnv
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 10 individuals, described by 198 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"
plot(res.pca.cnv,habillage=1)
#countsFInfo_CNV_backup
#We observe differences between the young and old patient samples (in dim 1 and dim2)

#FOR LATER CNV/mRNA-Seq EXPRESSION CNVRANGER-BASED CORRELATION ANALYSIS, EQUALIZE GENEIDs for BOTH FILTERED, LOG-TRANSFORMED, 
#NORMALIZED mRNA COUNTS and GISTIC CNV RECURRENT LEGIONS PEAKS:

countsF_extracted<-as.matrix(countsFInfo_backup[,2:11])
rownames(countsF_extracted)<-countsFInfo_backup$ID
#Setting equal the sampleIDs:
#colnames(normalized_df.log)<-colnames(assay(mACC.CN3))
colnames(countsF_extracted)<-colnames(countsFInfo_CNV_backup[,2:11])

phenoN2<-phenoN
rownames(phenoN2)<-colnames(countsF_extracted)
phenoN2$sample<-colnames(countsF_extracted)
cond2<-phenoN2$age.status 

#Checking NA
sum_na<-sum(is.na(countsF_extracted))
sum_na
## [1] 0
#I next normalize the mRNA-seq count matrix using DESeq2 and then transformed to log2:
#DESeq2 on COUNT MATRIX:
#Converting to integer to avoid error
countsF_extracted_int<-countsF_extracted
object.size(countsF_extracted_int)
## 27496 bytes
mode(countsF_extracted_int) <- "integer"
object.size(countsF_extracted_int)
## 20256 bytes
cds <- DESeqDataSetFromMatrix(countData = countsF_extracted_int,colData = phenoN2,design = ~ age.status) 
dds <- estimateSizeFactors(cds)
normalized_df <- counts(dds, normalized=TRUE)
normalized_df_log <- log2(normalized_df+1)
#FILTERED, NO-NA, NORMALIZED, LOG-TRANSFORMED MATRIX READY TO BE PROCESSED
#######################################################

#Subset to ensure same gene set is later co-analyzed
countsFInfo_CNV_backup_sub<-countsFInfo_CNV_backup[ rownames(countsFInfo_CNV_backup) %in% rownames(normalized_df_log), ]
 
#NOT TRANSPOSED df1_new<-as.data.frame(t(df1))
df1<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5J9.01A.11D.A29H.01" )]
df1$sampleID<-"TCGA.OR.A5J9.01A.11D.A29H.01"
colnames(df1)<-c("ID","chr", "start", "end", "state", "sampleID")
df1<-df1[,c("ID","chr", "start", "end", "sampleID","state")]

df2<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5JE.01A.11D.A29H.01" )]
df2$sampleID<-"TCGA.OR.A5JE.01A.11D.A29H.01"
colnames(df2)<-c("ID","chr", "start", "end", "state", "sampleID")
df2<-df2[,c("ID","chr", "start", "end", "sampleID","state")]

df3<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5JF.01A.11D.A29H.01" )]
df3$sampleID<-"TCGA.OR.A5JF.01A.11D.A29H.01"
colnames(df3)<-c("ID","chr", "start", "end", "state", "sampleID")
df3<-df3[,c("ID","chr", "start", "end", "sampleID","state")]

df4<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5JI.01A.11D.A29H.01" )]
df4$sampleID<-"TCGA.OR.A5JI.01A.11D.A29H.01"
colnames(df4)<-c("ID","chr", "start", "end", "state", "sampleID")
df4<-df4[,c("ID","chr", "start", "end", "sampleID","state")]

df5<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5K0.01A.11D.A29H.01" )]
df5$sampleID<-"TCGA.OR.A5K0.01A.11D.A29H.01"
colnames(df5)<-c("ID","chr", "start", "end", "state", "sampleID")
df5<-df5[,c("ID","chr", "start", "end", "sampleID","state")]

df6<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5KV.01A.11D.A29H.01" )]
df6$sampleID<-"TCGA.OR.A5KV.01A.11D.A29H.01"
colnames(df6)<-c("ID","chr", "start", "end", "state", "sampleID")
df6<-df6[,c("ID","chr", "start", "end", "sampleID","state")]

df7<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5L5.01A.11D.A29H.01" )]
df7$sampleID<-"TCGA.OR.A5L5.01A.11D.A29H.01"
colnames(df7)<-c("ID","chr", "start", "end", "state", "sampleID")
df7<-df7[,c("ID","chr", "start", "end", "sampleID","state")]

df8<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5LC.01A.11D.A29H.01" )]
df8$sampleID<-"TCGA.OR.A5LC.01A.11D.A29H.01"
colnames(df8)<-c("ID","chr", "start", "end", "state", "sampleID")
df8<-df8[,c("ID","chr", "start", "end", "sampleID","state")]

df9<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5LE.01A.11D.A29H.01" )]
df9$sampleID<-"TCGA.OR.A5LE.01A.11D.A29H.01"
colnames(df9)<-c("ID","chr", "start", "end", "state", "sampleID")
df9<-df9[,c("ID","chr", "start", "end", "sampleID","state")]

df10<-countsFInfo_CNV_backup_sub[, c("ID","chr", "start", "end", "TCGA.OR.A5LL.01A.11D.A29H.01" )]
df10$sampleID<-"TCGA.OR.A5LL.01A.11D.A29H.01"
colnames(df10)<-c("ID","chr", "start", "end", "state", "sampleID")
df10<-df10[,c("ID","chr", "start", "end", "sampleID","state")]

CNV_calls<-rbind(df1, df2, df3,df4, df5, df6, df7, df8, df9, df10)

CNV_calls_sort<-CNV_calls[order(CNV_calls$ID,decreasing = FALSE), ]
#ADDING value to 2 to state to convert from GISTIC format to CNVRanger format:

#CNV_calls_sort_add<-apply(CNV_calls_sort,1,function(x) x["state"]+2)
CNV_calls_sort$state<-CNV_calls_sort[,6]+2  
nrow(CNV_calls_sort)
## [1] 1800
rownames(CNV_calls_sort)<-c(1:nrow(CNV_calls_sort))
CNV_calls_sort_sort2<-CNV_calls_sort[order(CNV_calls_sort$chr,CNV_calls_sort$start), ] 
rownames(CNV_calls_sort_sort2)<-c(1:nrow(CNV_calls_sort_sort2))
CNV_calls_sort_sort2<-CNV_calls_sort_sort2[, c("chr","start", "end","sampleID", "state","ID")]
CNV_calls_sort_sort2$chr<-paste0("chr",CNV_calls_sort_sort2$chr )
length(unique(CNV_calls_sort_sort2[,"sampleID"]))
## [1] 10
#We have genomic ranges object for genes in ENSEMBL format OR HSBC GENE ID FORMAT:
df_sel_gene<-countsFInfo_CNV_backup_sub[, c("chr","start","end", "ensembl_gene_id", "ID")]
df_sel_gene$strand="*" 
df_sel_gene$score=1
df_sel_gene$chr<-paste0("chr", df_sel_gene$chr )
df_sel_gene<-df_sel_gene[, c("chr","start","end", "strand", "score", "ensembl_gene_id", "ID")]
 
gr_sel_gene<-makeGRangesFromDataFrame(df_sel_gene,keep.extra.columns=TRUE)
gr_sel_gene_hgnc<-gr_sel_gene
gr_sel_gene_ensembl<-gr_sel_gene
#split.field = "ensembl_gene_id"
#names.field = "ensembl_gene_id"
#ignore.strand=TRUE
#names(gr_sel_gene)<-mcols(gr_sel_gene)$ensembl_gene_id 
#names(gr_sel_gene_hgnc)<-mcols(gr_sel_gene_hgnc)$ID 
 
#Once read into an R data.frame, we group the calls by sample ID and convert them to a GRangesList. 
#Each element of the list corresponds to a sample, and contains the genomic coordinates of the CNV calls for this sample 
#(along with the copy number state in the state metadata column)
grl <- GenomicRanges::makeGRangesListFromDataFrame(CNV_calls_sort_sort2, split.field="sampleID", keep.extra.columns=TRUE)
grl <- GenomicRanges::sort(grl)
grl
## GRangesList object of length 10:
## $TCGA.OR.A5J9.01A.11D.A29H.01
## GRanges object with 180 ranges and 2 metadata columns:
##         seqnames              ranges strand |     state          ID
##            <Rle>           <IRanges>  <Rle> | <numeric> <character>
##     [1]     chr1     7954291-7985505      * |         1       PARK7
##     [2]     chr1     8004404-8026309      * |         1      ERRFI1
##     [3]     chr1   11106535-11262551      * |         2        MTOR
##     [4]     chr1   15490832-15526534      * |         2       CASP9
##     [5]     chr1   25884181-25906991      * |         2       STMN1
##     ...      ...                 ...    ... .       ...         ...
##   [176]     chrX   47561100-47571920      * |         3        ARAF
##   [177]     chrX   48574449-48581162      * |         3        RBM3
##   [178]     chrX   49187815-49200199      * |         3         SYP
##   [179]     chrX 123859724-123913976      * |         3        XIAP
##   [180]     chrX 154531391-154547572      * |         3        G6PD
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
## 
## ...
## <9 more elements>
#Specifically developed for CNV calls inferred from SNP-chip data, r Biocpkg("CNVRanger") allows to carry out a probe-level genome-wide association study (GWAS) 
#with quantitative phenotypes. As previously described da Silva et al., 2016, we construct CNV segments from probes representing common CN polymorphisms (allele frequency >1\%), and carry out a GWAS as implemented in PLINK using a standard linear regression of phenotype on allele dosage.
#For CNV segments composed of multiple probes, the segment p-value is chosen from the probe p-values, using either the probe with minimum p-value or the probe with maximum CNV frequency.
#For compatibility with PLINK's fam file format, we create another phenotype information dataframe  containing four columns representing patient traits from our MultiAssayExperiment
phenoN4 <- data.frame(sample.id=colnames(assay(mACC.CN3)),fam=colData(miniACC.assays.comp.age)$race,sex=colData(miniACC.assays.comp.age)$gender, age.status=colData(miniACC.assays.comp.age)$years_to_birth)
 
#We combine the GISTIC CNV recurrent lesions peak "calls" with the phenotype information in a RaggedExperiment for coordinated representation and analysis:
re_gwas <- RaggedExperiment::RaggedExperiment(grl, colData=phenoN4)
re_gwas 
## class: RaggedExperiment 
## dim: 1800 10 
## assays(2): state ID
## rownames: NULL
## colnames(10): TCGA.OR.A5J9.01A.11D.A29H.01 TCGA.OR.A5JE.01A.11D.A29H.01
##   ... TCGA.OR.A5LE.01A.11D.A29H.01 TCGA.OR.A5LL.01A.11D.A29H.01
## colData names(4): sample.id fam sex age.status
#Given a RaggedExperiment storing CNV calls together with phenotype information, and optionally a map file for probe-level coordinates, 
#the setupCnvGWAS function sets up all files needed for the GWAS analysis. The information required for analysis is stored in the resulting phen.info list:
#phen.info <- setupCnvGWAS("example", cnv.out.loc=re_gwas)
#phen.info
#Error in cnv.p.df[, 3] : subscript out of bounds
#In addition: There were 50 or more warnings (use warnings() to see the first 50)
#warnings()
#1: In .merge_two_Seqinfo_objects(x, y) :
#The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.)
#The last item of the list displays the working directory:
#all.paths <- phen.info$all.paths
#all.paths
#For the GWAS, chromosome names are assumed to be integer (i.e. 1, 2, 3, ...).
#We can then run the actual CNV-GWAS, here without correction for multiple testing which is done for demonstration only. 
#In real analyses, multiple testing correction is recommended to avoid inflation of false positive findings.
#segs.pvalue.gr <- cnvGWAS(phen.info, chr.code.name=chr.code.name, method.m.test="none")
#segs.pvalue.gr
#DUE TO ERROR STATEMENT, WE NEED TO FOREGO EXECUTION OF cnvGWAS() method

#In CNV analysis, it is often of interest to summarize individual calls across the population, (i.e. to define CNV regions), 
#for subsequent association analysis with expression and phenotype data. In the simplest case, this just merges overlapping individual 
#calls into summarized regions.We will use GISTIC process:By setting est.recur=TRUE, we deploy a GISTIC-like significance estimation
cnvrs <- populationRanges(grl, density=0.1, est.recur=TRUE)
## Excluding 976 copy-number neutral regions (CN state = 2, diploid)
cnvrs 
## GRanges object with 180 ranges and 3 metadata columns:
##         seqnames              ranges strand |      freq        type    pvalue
##            <Rle>           <IRanges>  <Rle> | <numeric> <character> <numeric>
##     [1]     chr1     7954291-7985505      * |         5        loss       0.0
##     [2]     chr1     8004404-8026309      * |         5        loss       0.0
##     [3]     chr1   11106535-11262551      * |         4        loss       0.1
##     [4]     chr1   15490832-15526534      * |         3        loss       0.3
##     [5]     chr1   25884181-25906991      * |         4        loss       0.1
##     ...      ...                 ...    ... .       ...         ...       ...
##   [176]     chrX   47561100-47571920      * |         8        both  0.000000
##   [177]     chrX   48574449-48581162      * |         8        both  0.000000
##   [178]     chrX   49187815-49200199      * |         8        both  0.000000
##   [179]     chrX 123859724-123913976      * |         7        both  0.452381
##   [180]     chrX 154531391-154547572      * |         7        both  0.452381
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
#plotRecurrentRegions(regs, genome, chr, pthresh = 0.05)
#We filter for recurrent CNVs that exceed a significance threshold of 0.05.
subset(cnvrs, pvalue < 0.05)
## GRanges object with 26 ranges and 3 metadata columns:
##        seqnames              ranges strand |      freq        type    pvalue
##           <Rle>           <IRanges>  <Rle> | <numeric> <character> <numeric>
##    [1]     chr1     7954291-7985505      * |         5        loss         0
##    [2]     chr1     8004404-8026309      * |         5        loss         0
##    [3]     chr1 110338506-110346681      * |         5        loss         0
##    [4]     chr1 114704469-114716771      * |         5        loss         0
##    [5]     chr5   52989340-53094779      * |         7        gain         0
##    ...      ...                 ...    ... .       ...         ...       ...
##   [22]    chr22   29603556-29698598      * |         5        loss         0
##   [23]    chr22   36281280-36387967      * |         5        loss         0
##   [24]     chrX   47561100-47571920      * |         8        both         0
##   [25]     chrX   48574449-48581162      * |         8        both         0
##   [26]     chrX   49187815-49200199      * |         8        both         0
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
#GRanges object with 28 ranges and 3 metadata columns:

#We illustrate the landscape of recurrent CNV regions using the function plotRecurrentRegions.
#We therefore provide the summarized CNV regions, a valid UCSC genome assembly, and a chromosome of interest.

plotRecurrentRegions(cnvrs, genome="hg19", chr="chr1")
plotRecurrentRegions(cnvrs, genome="hg19", chr="chr22")
plotRecurrentRegions(cnvrs, genome="hg19", chr="chr5")
plotRecurrentRegions(cnvrs, genome="hg19", chr="chrX")

sel.genes <- subset(gr_sel_gene, seqnames %in% paste0("chr", 1:2))
sel.genes_hgnc <- subset(gr_sel_gene_hgnc, seqnames %in% paste0("chr", 1:2))
sel.cnvrs <- subset(cnvrs, seqnames %in% paste0("chr", 1:2))

#The findOverlaps function from the GenomicRanges package is a general function for finding overlaps between two sets of genomic regions. 
#Here, we use the function to find protein-coding genes overlapping the summarized CNV regions.
#Resulting overlaps are represented as a Hits object, from which overlapping query and subject regions can be obtained with dedicated accessor 
#functions (named queryHits and subjectHits, respectively). Here, we use these functions to also annotate the CNV type (gain/loss) for genes overlapping with CNVs.

olaps <- GenomicRanges::findOverlaps(sel.genes, sel.cnvrs, ignore.strand=TRUE)
qh <- S4Vectors::queryHits(olaps)
sh <- S4Vectors::subjectHits(olaps)
cgenes <- sel.genes[qh]
cgenes$type <- sel.cnvrs$type[sh]
subset(cgenes, select = "type")
## GRanges object with 30 ranges and 1 metadata column:
##           seqnames              ranges strand |        type
##              <Rle>           <IRanges>  <Rle> | <character>
##    DIRAS3     chr1   68045886-68051631      * |        loss
##    IGFBP2     chr2 216632828-216664436      * |        gain
##   RPS6KA1     chr1   26529761-26575030      * |        loss
##       FN1     chr2 215360440-215436073      * |        gain
##   BCL2L11     chr2 111119378-111168445      * |        gain
##       ...      ...                 ...    ... .         ...
##    ERRFI1     chr1     8004404-8026309      * |        loss
##     PARP1     chr1 226360210-226408154      * |        both
##     CASP9     chr1   15490832-15526534      * |        loss
##      MSH2     chr2   47403067-47663146      * |        both
##     CASP8     chr2 201233443-201287711      * |        gain
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
#GRanges object with 33 ranges and 1 metadata column:

olaps_hgnc <- GenomicRanges::findOverlaps(sel.genes_hgnc, sel.cnvrs, ignore.strand=TRUE)
qh_hgnc <- S4Vectors::queryHits(olaps_hgnc)
sh_hgnc <- S4Vectors::subjectHits(olaps_hgnc)
cgenes_hgnc <- sel.genes_hgnc[qh_hgnc]
cgenes_hgnc$type <- sel.cnvrs$type[sh_hgnc]
subset(cgenes_hgnc, select = "type")
## GRanges object with 30 ranges and 1 metadata column:
##           seqnames              ranges strand |        type
##              <Rle>           <IRanges>  <Rle> | <character>
##    DIRAS3     chr1   68045886-68051631      * |        loss
##    IGFBP2     chr2 216632828-216664436      * |        gain
##   RPS6KA1     chr1   26529761-26575030      * |        loss
##       FN1     chr2 215360440-215436073      * |        gain
##   BCL2L11     chr2 111119378-111168445      * |        gain
##       ...      ...                 ...    ... .         ...
##    ERRFI1     chr1     8004404-8026309      * |        loss
##     PARP1     chr1 226360210-226408154      * |        both
##     CASP9     chr1   15490832-15526534      * |        loss
##      MSH2     chr2   47403067-47663146      * |        both
##     CASP8     chr2 201233443-201287711      * |        gain
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
#GRanges object with 33 ranges and 1 metadata column:

#We illustrate the original CNV calls on overlapping genomic features (here: protein-coding genes).
#For this purpose, an oncoPrint plot provides a useful summary in a rectangular fashion (genes in the rows, samples in the columns).
#Stacked barplots on the top and the right of the plot display the number of altered genes per sample and the number of altered samples per gene, respectively.

cnvOncoPrint(grl, cgenes)
cnvOncoPrint(grl, cgenes_hgnc)

#Overlap permutation test
#As a certain amount of overlap can be expected just by chance, an assessment of statistical significance is needed to decide whether the observed overlap 
#is greater (enrichment) or less (depletion) than expected by chance.The regioneR package implements a general framework for testing overlaps of genomic regions
#based on permutation sampling. This allows to repeatedly sample random regions from the genome, matching size and chromosomal distribution of the region set under 
#study (here: the CNV regions). By recomputing the overlap with the functional features in each permutation, statistical significance of the observed overlap 
#can be assessed.We demonstrate in the following how this strategy can be used to assess the overlap between the detected CNV regions and protein-coding regions 
#in the human genome. We expect to find a depletion as protein-coding regions are highly conserved and rarely subject to long-range structural variation such as CNV.
#Hence, is the overlap between CNVs and protein-coding genes less than expected by chance?To answer this question, we apply an overlap permutation test 
#with 100 permutations (ntimes=100), while maintaining chromosomal distribution of the CNV region set (per.chromosome=TRUE). 
#Furthermore, we use the option count.once=TRUE to count an overlapping CNV region only once, even if it overlaps with 2 or more genes. 
#We also allow random regions to be sampled from the entire genome (mask=NA), although in certain scenarios masking certain regions such 
#as telomeres and centromeres is advisable. Also note that we use 100 permutations for demonstration only. 
#To draw robust conclusions a minimum of 1000 permutations should be carried out.

#BSgenome.Hsapiens.UCSC.hg38, except that each of them has the 4 following masks on top: 
#(1) the mask of assembly gaps (AGAPS mask), (2) the mask of intra-contig ambiguities (AMB mask), 
#(3) the mask of repeats from RepeatMasker (RM mask), and (4) the mask of repeats from Tandem Repeats Finder (TRF mask). 
#Only the AGAPS and AMB masks are "active" by default. The sequences are stored in MaskedDNAString objects.
res <- regioneR::overlapPermTest(A=sel.cnvrs, B=sel.genes, ntimes=100, genome="hg38", mask=NA, per.chromosome=TRUE, count.once=TRUE)
res
## $numOverlaps
## P-value: 0.0099009900990099
## Z-score: 56.8712
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 30
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
summary(res[[1]]$permuted)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.3     1.0     2.0
#The resulting permutation p-value indicates a significant depletion. Out of the 197 CNV regions (cnvrs object), 
#33 overlap with at least one gene.
plot(res)

#RE-attempting with entire gene set(not just chromosomes 1 and 2):
res2 <- regioneR::overlapPermTest(A=cnvrs, B=gr_sel_gene, ntimes=100, genome="hg38", mask=NA, per.chromosome=TRUE, count.once=TRUE)
res2
## $numOverlaps
## P-value: 0.0099009900990099
## Z-score: 104.4132
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 180
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
summary(res2[[1]]$permuted)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    2.00    2.42    3.00    8.00
plot(res2)

#A more pronounced peak became apparent

mRNA-Seq AND GISTIC CNV DATA BLOCK CORRELATION ANALYSIS

#Studies of expression quantitative trait loci (eQTLs) aim at the discovery of genetic variants that explain variation in gene expression levels 
#(Nica and Dermitzakis, 2013). Mainly applied in the context of SNPs, the concept also naturally extends to the analysis of CNVs.
#The CNVRanger package implements association testing between CNV regions and RNA-seq read counts using edgeR, 
#which applies generalized linear models based on the negative-binomial distribution while incorporating normalization factors for different library sizes.
#In the case of only one CN state deviating from 2n for a CNV region under investigation, this reduces to the classical 2-group comparison. 
#For more than two states (e.g. 0n, 1n, 2n), edgeR’s ANOVA-like test is applied to test all deviating groups 
#for significant expression differences relative to 2n.
#Assuming distinct modes of action, effects observed in the CNV-expression analysis are typically divided into (i) local effects (cis), 
#where expression changes coincide with CNVs in the respective genes, and (ii) distal effects (trans), where CNVs supposedly affect trans-acting regulators 
#such as transcription factors.However, due to power considerations and to avoid detection of spurious effects, stringent filtering of 
#(i) not sufficiently expressed genes, and (ii) CNV regions with insufficient sample size in groups deviating from 2n, should be carried out 
#when testing for distal effects. Local effects have a clear spatial indication and the number of genes locating in or close to a CNV region of 
#interest is typically small; testing for differential expression between CN states is thus generally better powered for local effects 
#and less stringent filter criteria can be applied.In the following, we carry out CNV-expression association analysis by providing the 
#CNV regions to test (cnvrs), the individual CNV calls (grl) to determine per-sample CN state in each CNV region, the RNA-seq read counts (rse), 
#and the size of the genomic window around each CNV region (window). The window argument thereby determines which genes are considered for testing 
#for each CNV region and is set here to 1 Mbp.Further, use the filter.by.expr and min.samples arguments to exclude from the analysis 
#(i) genes with very low read count across samples, and (ii) CNV regions with fewer than min.samples samples in a group deviating from 2n.

rcounts<-normalized_df_log
rcounts<-rcounts[rownames(rcounts) %in% rownames(df_sel_gene),]
 
#traceback()
#RENAME SAMPLEID NAMES FOR ALL OBJECTS:
test<-gr_sel_gene_hgnc
#names(gr_sel_gene_ensembl)<-mcols(gr_sel_gene_ensembl)$ensembl_gene_id
#names(gr_sel_gene_hgnc)<-mcols(gr_sel_gene_hgnc)$ID

rranges <- GenomicRanges::granges(test)[rownames(rcounts)]
rse <- SummarizedExperiment(assays=list(rcounts=rcounts), rowRanges=rranges)
rse
## class: RangedSummarizedExperiment 
## dim: 180 10 
## metadata(0):
## assays(1): rcounts
## rownames(180): DIRAS3 MAPK14 ... IDH3A SQSTM1
## rowData names(0):
## colnames(10): TCGA.OR.A5J9.01A.11D.A29H.01 TCGA.OR.A5JE.01A.11D.A29H.01
##   ... TCGA.OR.A5LE.01A.11D.A29H.01 TCGA.OR.A5LL.01A.11D.A29H.01
## colData names(0):
res <- cnvEQTL(cnvrs, grl, rse,  min.samples=1,window = "1Mbp", verbose = TRUE)
## Restricting analysis to 10 intersecting samples
## Preprocessing RNA-seq data ...
## Summarizing per-sample CN state in each CNV region
## Excluding 45 cnvrs not satisfying min.samples threshold
## Analyzing 35 regions with >=1 gene in the given window
## 1 of 35
## 2 of 35
## 3 of 35
## 4 of 35
## 5 of 35
## 6 of 35
## 7 of 35
## 8 of 35
## 9 of 35
## 10 of 35
## 11 of 35
## 12 of 35
## 13 of 35
## 14 of 35
## 15 of 35
## 16 of 35
## 17 of 35
## 18 of 35
## 19 of 35
## 20 of 35
## 21 of 35
## 22 of 35
## 23 of 35
## 24 of 35
## 25 of 35
## 26 of 35
## 27 of 35
## 28 of 35
## 29 of 35
## 30 of 35
## 31 of 35
## 32 of 35
## 33 of 35
## 34 of 35
## 35 of 35
#The resulting GRangesList contains an entry for each CNV region tested, storing the genes tested in the genomic window around the CNV region, 
#and (i) log2 fold change with respect to the 2n group, (ii) edgeR's DE p-value, and (iii) the (per default) Benjamini-Hochberg adjusted p-value.

#We can illustrate differential expression of genes in the neighborhood of a CNV region of interest using the function plotEQTL.
#The following regions are able to be graphically depicted: 1,2,3,4,8,9,13,16,23,34,35
res[2]
## GRangesList object of length 1:
## $`chr1:8004404-8026309`
## GRanges object with 1 range and 4 metadata columns:
##         seqnames          ranges strand |  logFC.CN1 logFC.CN3    PValue
##            <Rle>       <IRanges>  <Rle> |  <numeric> <numeric> <numeric>
##   PARK7     chr1 7954291-7985505      * | -0.0526768        NA  0.237278
##         AdjPValue
##         <numeric>
##   PARK7   0.37569
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
r <- GRanges(names(res)[2])
plotEQTL(cnvr=r, genes=res[[2]], genome="hg19", cn="CN1") 

###########################################CORRELATION OF RAW mRNA-Seq and GISTIC CNV DATA ACROSS ALL PATIENTS##########################

mRNA_expr<-miniACC.assays.comp.age.cnvcalls.ranges[[2]]
cnv_gistic<-miniACC.assays.comp.age.cnvcalls.ranges[[3]]
cnv_gistic_assay<-assay(cnv_gistic)
mRNA_expr_assay<-assay(mRNA_expr)
colnames(cnv_gistic_assay)==colnames(mRNA_expr_assay)  
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
colnames(cnv_gistic_assay)<-colnames(mRNA_expr_assay)
# Let's correlate first gene (first row):
plot(log2(mRNA_expr_assay[1,]),cnv_gistic_assay[1,])   

cor.test(log2(mRNA_expr_assay[1,]),cnv_gistic_assay[1,], method="spearman") 
## Warning in cor.test.default(log2(mRNA_expr_assay[1, ]), cnv_gistic_assay[1, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  log2(mRNA_expr_assay[1, ]) and cnv_gistic_assay[1, ]
## S = 59, p-value = 0.05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##  rho 
## 0.64
#0.6396021 is lower correlation R2 coefficient than firebrowse:0.8455

mRNA_expr_assay_m <- as.matrix(mRNA_expr_assay[,1:10])
cnv_gistic_assay_m <- as.matrix(cnv_gistic_assay[,1:10])
rownames(mRNA_expr_assay_m)<-rownames(mRNA_expr_assay) 
rownames(cnv_gistic_assay_m)<-rownames(cnv_gistic_assay)  

#Determining how data are distributed for first gene (Should be matrix?)
hist(as.numeric(mRNA_expr_assay[1,1:10]))

hist(as.numeric(cnv_gistic_assay[1,1:10]))

#Determining how many other genes are strongly correlated between mRNA and CN assay omic data sets:
cors <- diag(cor(t(mRNA_expr_assay_m),t(cnv_gistic_assay_m),method="pearson"))
cors.sign <- cors[abs(cors)>0.67 & !is.na(cors)]
cors.sign #12 genes
##  [1] -0.793  0.782 -0.694  0.719  0.707  0.733  0.674  0.694  0.714  0.705
## [11]  0.819  0.691
cor_set<-mRNA_expr_assay_m[c(42,68,69,98,112,114,116,138,145,158,184,190),]
rownames(cor_set)
##  [1] "ATM"    "ACVRL1" "TSC1"   "GSK3A"  "KEAP1"  "XRCC1"  "NFKB1"  "NF2"   
##  [9] "MYH9"   "YWHAB"  "MSH2"   "DIABLO"
#These correspond to 12 genes that were strongly correlated between mRNA and CN assay omic data sets:
##The mRNA-seq RAW, UNFILTERED, NON-LOG TRANSFORMED, NON-NORMALIZED expression levels and GISTIC CNV copy number of the 12 genes 
#"ATM" "ACVRL1" "TSC1"   "GSK3A"  "KEAP1"  "XRCC1"  "NFKB1"  "NF2"    "MYH9"   "YWHAB"  "MSH2"   "DIABLO" are significantly correlated across ALL patients

#Plotting these 12 genes
op <- par(mfrow=c(2,2))
#[1] "character"
for (i in 1:length(cors.sign)){
  #gene <- paste(gene, (rownames(mRNA_expr_assay)[i]), sep =" ") 
  gene <- names(cors.sign)[i]
  #x = as.numeric(mRNA_expr_assay_m[gene,])
  #y = as.numeric(cnv_gistic_assay_m[gene,])
  #plot(x,y, main=gene, cex.main=0.8)
  #fit <- lm(y ~ x)
  #abline(fit, col="chartreuse3") 
}

#Because miRNA data set is higher number of gene target rows compared to identical CN and mRNA dataset, it will not be used for correlation analysis

###############################CORRELATION BETWEEN  FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED mRNA-Seq and GISTIC CNV ACROSS YOUNG AND OLD PATIENT GROUPS######
#INDEXES:
#YOUNG PATIENTS=1,2,4,6,9
#OLD PATIENTS=3,5,7,8,10

#YOUNG PATIENTS
#Determining how data are distributed for first gene for young patients(Should be matrix?)
hist(as.numeric(mRNA_expr_assay[1,c(1,2,4,6,9)]))
hist(as.numeric(cnv_gistic_assay[1,c(1,2,4,6,9)]))

#Determining how many other genes are strongly correlated between mRNA and CNV assay omic data sets across YOUNG PATIENTS:
cors.young <- diag(cor(t(mRNA_expr_assay_m[,c(1,2,4,6,9)]),t(cnv_gistic_assay_m[,c(1,2,4,6,9)]),method="pearson"))
## Warning in cor(t(mRNA_expr_assay_m[, c(1, 2, 4, 6, 9)]), t(cnv_gistic_assay_m[,
## : the standard deviation is zero
cors.young.sign <- cors.young[abs(cors.young)>0.67 & !is.na(cors.young)]
cors.young.sign 
##  [1]  0.698  0.970  0.725  0.978  0.828  0.727 -0.685 -0.902  0.705 -0.720
## [11]  0.939 -0.986 -0.795  0.859 -0.931  0.757  0.811  0.923  0.933  0.696
## [21]  0.985 -0.799  0.751  0.802  0.860  0.704  0.680  0.685  0.919 -0.983
## [31]  0.697 -0.773  0.807  0.905  0.695  0.680  0.959  0.909  0.805  0.816
## [41]  0.785  0.691  0.963  0.731  0.686  0.872  0.811  0.697  0.746  0.843
length(cors.young.sign) #50 genes
## [1] 50
#cor_set_young<-mRNA_expr_assay_m[c(),]
#rownames(cor_set_young)

##The mRNA-seq RAW, UNFILTERED, NON-NORMALIZED, NON-LOG TRANSFORMED expression levels and GISTIC CNV copy number of 50 genes are 
#significantly correlated across the 5 selected young patients

#OLD PATIENTS
#Determining how data are distributed for first gene for old patients(Should be matrix?)
hist(as.numeric(mRNA_expr_assay[1,c(3,5,7,8,10)]))
hist(as.numeric(cnv_gistic_assay[1,c(3,5,7,8,10)]))

#Determining how many other genes are strongly correlated between mRNA and CN assay omic data sets across YOUNG PATIENTS:
cors.old <- diag(cor(t(mRNA_expr_assay_m[,c(3,5,7,8,10)]),t(cnv_gistic_assay_m[,c(3,5,7,8,10)]),method="pearson"))

cors.old.sign <- cors.old[abs(cors.old)>0.67 & !is.na(cors.old)]
cors.old.sign 
##  [1]  0.803 -0.913  0.713  0.794  0.717  0.813 -0.823 -0.707  0.979  0.786
## [11]  0.687  0.819 -0.712 -0.795  0.817 -0.846  0.980 -0.767  0.888 -0.909
## [21]  0.828 -0.785  0.784  0.763  0.739  0.673  0.739  0.696  0.923  0.993
## [31] -0.680  0.769  0.800  0.772  0.670  0.686  0.802  0.833  0.813  0.816
## [41] -0.688  0.860 -0.914  0.891
length(cors.old.sign) # 44 genes
## [1] 44
#cor_set_old<-mRNA_expr_assay_m[c(),]
#rownames(cor_set_old)
##The mRNA-seq RAW, UNFILTERED, NON-LOG TRANSFORMED, NON-NORMALIZED expression levels and GISTIC CNV copy number of 44 genes are 
#significantly correlated across the 5 selected OLD patients

########################################MFA ON FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED mRNA-SEQ AND GISTIC CNV DATA##############################

# GISTIC CNV
countsFInfo_CNV_backup_MFA<-countsFInfo_CNV_backup[,2:11]
# transpose
countsFInfo_CNV_backup_MFA.t<-t(countsFInfo_CNV_backup_MFA)
# assign names, we include a suffix to differentiate genes from expression
colnames(countsFInfo_CNV_backup_MFA.t)<-paste(countsFInfo_CNV_backup$ID,"cnv",sep=".")
#mRNA Expression
countsF_TPM_LOG_DF_MFA <- countsF_TPM_LOG_DF[,1:10]
colnames(countsF_TPM_LOG_DF_MFA) <- colnames(countsFInfo_CNV_backup_MFA) #To perform later MFA, we need to have the same names
 
# transpose
countsF_TPM_LOG_DF_MFA.t<-t(countsF_TPM_LOG_DF_MFA)
# assign names, we include a suffix to differentiate genes from cnv
colnames(countsF_TPM_LOG_DF_MFA.t)<-paste(countsF_TPM_LOG_DF$ID,"mRNAexp",sep=".")

#miRNA Expression
countsF_TPM_LOG_DF_micro_MFA<-countsF_TPM_LOG_DF_micro[,1:10]
colnames(countsF_TPM_LOG_DF_micro_MFA) <- colnames(countsFInfo_CNV_backup_MFA)
 
# transpose
countsF_TPM_LOG_DF_micro_MFA.t<-t(countsF_TPM_LOG_DF_micro_MFA)
# Assign names, we include a suffix to differentiate genes from cnv
colnames(countsF_TPM_LOG_DF_micro_MFA.t)<-paste(countsF_TPM_LOG_DF_micro$ID,"miRNAexp",sep=".")

mRNAexp.l<-nrow(countsF_TPM_LOG_DF_MFA )
cnv.l<-nrow(countsFInfo_CNV_backup_MFA )
dat4Facto<-data.frame(cond=as.factor(cond2),countsF_TPM_LOG_DF_MFA.t,countsFInfo_CNV_backup_MFA.t) 
dim(dat4Facto)
## [1]  10 379
es = MFA(dat4Facto, group=c(1,mRNAexp.l,cnv.l), type=c("n",rep("c",2)), ncp=5, name.group=c("cond2","mRNAexp","cnv"),num.group.sup=c(1)) 

#top correlated genes with first dimension (all of them come from the expression block)
top10.1 <- sort(es$global.pca$var$cor[,"Dim.1"],decreasing=TRUE)[1:10]
top10.1
## CDKN1B.cnv  ERBB3.cnv ACVRL1.cnv RICTOR.cnv  ACACB.cnv  GAPDH.cnv TUBA1B.cnv 
##      0.964      0.964      0.964      0.964      0.964      0.964      0.964 
##   KRT5.cnv   KRAS.cnv  FOXM1.cnv 
##      0.964      0.964      0.964
#top correlated genes with second dimension (all of them come from the CN block)
top10.2 <- sort(es$global.pca$var$cor[,"Dim.2"],decreasing=TRUE)[1:10]
top10.2
##  PRKCA.mRNAexp SQSTM1.mRNAexp      YWHAE.cnv PIK3R1.mRNAexp    SRC.mRNAexp 
##          0.894          0.886          0.878          0.851          0.811 
##      MAPK3.cnv      PRRT2.cnv      EEF2K.cnv      MYH11.cnv   AKT3.mRNAexp 
##          0.796          0.796          0.796          0.796          0.793

mRNA-Seq AND mRNA-Seq DATA BLOCK CORRELATION ANALYSIS

#Correlations between the significative miRNAs and their significative targets obtained by TargetScan will be evaluated.
#Correlations are measured and also some plots are generated on your hard disk. We will in general select those inversely 
#correlated miRNAs and genes with a correlation Rho < -0.5 or 0.67

x_rna_backup<-x_rna
x_rna_backup<-as.matrix(x_rna_backup)
x_micro_backup<-x_micro
x_micro_backup<-as.matrix(x_micro_backup)
colnames(x_rna_backup)<-colnames(x_micro_backup)

mRNA.res2<-assay(mACC.exp3)
mRNA.res2<-as.data.frame(mRNA.res2)
mRNA.res2$Symbol<-rownames(mRNA.res2)
miRNA.res.hsa2<-assay(mACC.mir3)
miRNA.res.hsa2<-as.data.frame(miRNA.res.hsa2)
miRNA.res.hsa2$miRNA<-rownames(miRNA.res.hsa2)

#Correlations between the significative miRNAs and their significative targets obtained by TargetScan. 
#Correlations are measured and also dot plots with regression lines are generated on your hard disk. 
#Then, we will correct the p-values using FDR but we will in this case select those inversely correlated miRNAs and genes 
#with a correlation Rho < -0.5 and a p-value < 0.05 to obtain more results.

resultsComb<-"./ResultsComb"
if(!dir.exists(resultsComb)) dir.create(resultsComb)
#cols<-as.vector(car::recode(pData(my.targets)$Cond,"'chord' ='green';'notochord' ='blue';"))     
#pchs<-as.vector(car::recode(pData(my.targets)$Cond, "'chord' =16;'notochord' =17;"))     

miRNAs<-miRNA.res.hsa2$miRNA
mRNAs<-mRNA.res2$Symbol

miRNACorrel<-function(res.miRNA,res.mRNA,data.miRNA,data.mRNA,resultsDir){
  #Function that looks for targets from a list of miRNAs and 
  #returns a pdf with regression lines and a summary xls with correlations
  #needs funcions miRNAGenes defined previously
  miRNAs<-res.miRNA$miRNA 
  mRNAs<-res.mRNA$Symbol
  
  for (i in miRNAs){
    miRNA.genes<-miRNAGenes(i)
    miRNA.genes.deg<-intersect(miRNA.genes,mRNAs)
    #correlations 
    lng<-length(miRNA.genes.deg)
    if (lng>0){
      cor.rho<-array(NA,lng)
      cor.pval<-array(NA,lng)
      miRNA.id<-rownames(res.miRNA[res.miRNA$miRNA==i,])
      y=as.vector(data.miRNA[miRNA.id,])
      
      #pdf(file.path(resultsComb, paste0(miRNA.id,".corr.mRNA.miRNA.pdf")))
      for (j in 1:lng){
        mRNA<-miRNA.genes.deg[j]
        mRNA.id<-rownames(res.mRNA[!is.na(res.mRNA$Symbol) & res.mRNA$Symbol==mRNA,])[1]  
        x=as.vector(data.mRNA[mRNA.id,])
        cor<-cor.test(x,y, method = "spearman",exact=FALSE)
        cor.pval[j]<-cor$p.value
        cor.rho[j]<-cor$estimate 
        #we will plot just those combinations having a p.value<0.05 and a regression coef above 0.5 (positive or negative)
        #if (cor$p.value < 0.05 & cor$estimate<(-0.5)){ 
        #plot(x, y, main=mRNA,
        #xlab="log2RMA expression",
        #ylab="log2miRMA expression",
        #type="p",
        #xlim=c(0,16),
        #ylim=c(0,16),
        #col=cols,
        #pch=pchs,
        #cex=0.8)
        #fit <- lm(y ~ x)
        #abline(fit, col="chartreuse3",xlim=c(0,16)) 
        #} 
      }
      #dev.off()  #close pdf file
      cor.table<-data.frame("miRNA ID"=rep(miRNA.id,lng),
                            "miRNA"=rep(i,lng),
                            miRNA.genes.deg,
                            "Rho"=as.vector(cor.rho),
                            "pval"=as.vector(cor.pval),
                            "adj.pval"=p.adjust(cor.pval))
      cor.table.f<-cor.table[cor.table$pval<0.05,] #just a soft threshold
      
      #write.csv2(cor.table.f,
      #file=file.path(resultsDir,paste(miRNA.id,"csv",sep=".")))
    }
  }  
  return(cor.table.f)
}    

cor.table.f.returned<-miRNACorrel(res.miRNA=miRNA.res.hsa2,res.mRNA=mRNA.res2,data.miRNA= x_micro_backup,data.mRNA= x_rna_backup,resultsDir=resultsComb)
cor.table.f.returned
##      miRNA.ID      miRNA miRNA.genes.deg   Rho   pval adj.pval
## 14 hsa-let-7i hsa-let-7i           CASP3 0.636 0.0479    0.671
## 15 hsa-let-7i hsa-let-7i            GAB2 0.661 0.0376    0.564

CIRCOS PLOT DEPICTING PREVIOUSLY OBTAINED CORRELATION COEFFICIENTS ALONG WITH FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED mRNA-SEQ COUNTS, miRNA-SEq COUNTS, AND ENCODED GISTIC CNV VALUES FOR SEPARATE OLD AND YOUNG PATIENT GROUPS:

options(stringsAsFactors = FALSE) 

#RECALL FILTERED, NORMALIZED, LOG-TRANSFORMED mRNA-SEQ MATRIXES AND CNV DATAFRAME:
countsF_TPM_LOG<-log2(countsTPM[,1:10]+2)
countsF_TPM_LOG_DF<-as.data.frame(countsF_TPM_LOG)
countsF_TPM_LOG_DF$ID<-countsFInfo_backup$ID
countsF_TPM_LOG_DF$chr<-countsFInfo_backup$chr
countsF_TPM_LOG_DF$start<-countsFInfo_backup$start
countsF_TPM_LOG_DF$end<-countsFInfo_backup$end
cors.young[is.na(cors.young)] <- 0

#miRNA Expression
countsF_TPM_LOG_micro<-log2(countsTPM_micro[,1:10]+2)
countsF_TPM_LOG_DF_micro<-as.data.frame(countsF_TPM_LOG_micro)
countsF_TPM_LOG_DF_micro$ID<-countsFInfo_micro$ID
countsF_TPM_LOG_DF_micro$chr<-countsFInfo_micro$chromosome_name
countsF_TPM_LOG_DF_micro$start<-countsFInfo_micro$start_position
countsF_TPM_LOG_DF_micro$end<-countsFInfo_micro$end_position
countsF_TPM_LOG<-log2(countsTPM[,1:10]+2)
countsF_TPM_LOG_DF<-as.data.frame(countsF_TPM_LOG)
countsF_TPM_LOG_DF$ID<-countsFInfo_backup$ID
countsF_TPM_LOG_DF$chr<-countsFInfo_backup$chr
countsF_TPM_LOG_DF$start<-countsFInfo_backup$start
countsF_TPM_LOG_DF$end<-countsFInfo_backup$end
cors.young[is.na(cors.young)] <- 0

range(assays(mRNA_expr)$"exprs")
## [1]      0 206162
table(seqnames(rowRanges(mRNA_expr)))
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   22   11   13    7    9    5    9    8    9    8   10   11    2    2    7    7 
##   17   18   19   20   21   22    X    Y chrM 
##   13    4   16    8    2    6    6    0    0
rowRanges(mRNA_expr) 
## GRanges object with 195 ranges and 1 metadata column:
##          seqnames              ranges strand |     gene_id
##             <Rle>           <IRanges>  <Rle> | <character>
##   DIRAS3        1   68511645-68516481      - |        9077
##   MAPK14        6   35995454-36079013      + |        1432
##     YAP1       11 101981192-102104154      + |       10413
##   CDKN1B       12   12870302-12875305      + |        1027
##    ERBB2       17   37844393-37884915      + |        2064
##      ...      ...                 ...    ... .         ...
##    MACC1        7   20174279-20257013      - |      346389
##     CHGA       14   93389445-93401638      + |        1113
##    IDH3A       15   78441719-78462884      + |        3419
##   SQSTM1        5 179233388-179265077      + |        8878
##   KCNJ13        2 233630512-233641275      - |        3769
##   -------
##   seqinfo: 25 sequences (1 circular) from 2 genomes (GRCh37.p13, hg19)
#Already a GRanges Object (No need to unlist)
#mRNA_expr.gr<-unlist(rowRanges(mRNA_expr))#from a GRangesList to a GRanges object?  
range_df<-as.data.frame(rowRanges(mRNA_expr)) 
range_df$gene_symbol<-rownames(range_df)
 
T.cors.old<-data.frame("chr"=range_df$seqnames,"Start"=as.integer(range_df$start),"End"=as.integer(range_df$end),cors.old,row.names=NULL)
T.cors.young<-data.frame("chr"=range_df$seqnames,"Start"=as.integer(range_df$start),"End"=as.integer(range_df$end),cors.young,row.names=NULL)

T.CN.old<-data.frame("chr"=countsFInfo_CNV_backup$chr,"Start"=as.integer(countsFInfo_CNV_backup$start),"End"=as.integer(countsFInfo_CNV_backup$end),countsFInfo_CNV_backup[,c(4,6,8,9,11)],row.names=NULL)
T.CN.young<-data.frame("chr"=countsFInfo_CNV_backup$chr,"Start"=as.integer(countsFInfo_CNV_backup$start),"End"=as.integer(countsFInfo_CNV_backup$end),countsFInfo_CNV_backup[,c(2,3,5,7,10)],row.names=NULL)

T.mRNA.old<-data.frame("chr"=countsF_TPM_LOG_DF$chr,"Start"=as.integer(countsF_TPM_LOG_DF$start),"End"=as.integer(countsF_TPM_LOG_DF$end),countsF_TPM_LOG_DF[,c(3,5,7,8,10)],row.names=NULL)
T.mRNA.young<-data.frame("chr"=countsF_TPM_LOG_DF$chr,"Start"=as.integer(countsF_TPM_LOG_DF$start),"End"=as.integer(countsF_TPM_LOG_DF$end),countsF_TPM_LOG_DF[,c(1,2,4,6,9)],row.names=NULL)
T.miRNA.old<-data.frame("chr"=countsF_TPM_LOG_DF_micro$chr,"Start"=as.integer(countsF_TPM_LOG_DF_micro$start),"End"=as.integer(countsF_TPM_LOG_DF_micro$end),countsF_TPM_LOG_DF_micro[,c(3,5,7,8,10)],row.names=NULL)
T.miRNA.young<-data.frame("chr"=countsF_TPM_LOG_DF_micro$chr,"Start"=as.integer(countsF_TPM_LOG_DF_micro$start),"End"=as.integer(countsF_TPM_LOG_DF_micro$end),countsF_TPM_LOG_DF_micro[,c(1,2,4,6,9)],row.names=NULL)
T_labels<-data.frame("chr"=range_df$seqnames,"Start"=as.integer(range_df$start),"End"=as.integer(range_df$end),range_df$gene_symbol,row.names=NULL)

#Plot of FILTERED, TPM-NORMALIZED, LOG-TRANSFORMED DATA VIA Circos FOR EACH OF THE TWO YOUNG AND OLD PATIENT GROUPS COMBINED PATIENTS  
colors <- rainbow(10, alpha=0.5)
par(mar=c(2, 2, 2, 2))
plot(c(1,800), c(1,800), type="n", axes=FALSE, xlab="", ylab="", main="")
circos(R=300, cir="hg19", W=4, type="chr", print.chr.lab=TRUE, scale=TRUE)
circos(R=260, cir="hg19", W=40, mapping=T.miRNA.young,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=220, cir="hg19", W=40, mapping=T.miRNA.old,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=180, cir="hg19", W=40, mapping=T.mRNA.young,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=140, cir="hg19", W=40, mapping=T.mRNA.old,col.v=4,type="heatmap2", cluster=TRUE, col.bar=TRUE, lwd=0.1, col="blue")
circos(R=120, cir="hg19", W=20,  mapping=T.CN.young,   col.v=4,   type="ml3", B=FALSE, lwd=1, cutoff=0)
circos(R=100, cir="hg19", W=20,  mapping=T.CN.old,   col.v=4,   type="ml3", B=FALSE, lwd=1, cutoff=0)
circos(R=80, cir="hg19", W=20,  mapping=T.cors.young,  col.v=4, type="s",   B=TRUE, lwd=1, col=colors[1])
circos(R=60, cir="hg19", W=20,  mapping=T.cors.old,  col.v=4, type="s",   B=TRUE, lwd=1, col=colors[1])
#Adding labels for the genes
circos(R=310, cir="hg19", W=20, mapping=T_labels, type="label", side="out", col=c("black", "blue","red"), cex=0.4)

MULTI-FACTOR ANALYSIS (MFA)

##########################################GLOBAL MFA ON RAW CNV, mRNA-Seq, miRNA-Seq DATA####################################
cond<-as.factor(colData(miniACC.assays.comp.age)$years_to_birth)
dat4Facto<-data.frame(cond=cond,t(mACC.exp.c3),t(mACC.CN.c3),t(mACC.mir.c3)) 
rownames(dat4Facto) <- gsub("TCGA-","",rownames(cd3))
 
#We will consider CN as scaled but it would be better to consider it as categorical
res = MFA(dat4Facto, group=c(1,exp.l3,cn.l3,mir.l3), type=c("n","c","s","c"), ncp=5, name.group=c("cond","mRNA","CNV","miRNA"),num.group.sup=c(1)) 

#Extra informative plots
plot(res,choix="ind",habillage = "cond")

plotellipses(res, keepvar = "cond")

#There seems to be a clear separation between  old and young patients.
#Patient sample OR-A5L5 and OR-A5LC appear to be an outlier and will be replaced with a different aged patient

########################################GLOBAL MFA ON FILTERED, NORMALIZED, LOG-TRANSFORMED CNV, mRNA-Seq, miRNA-Seq DATA#########################
mRNAexp.l<-nrow(countsF_TPM_LOG_DF_MFA)
cnv.l<-nrow(countsFInfo_CNV_backup_MFA)
miRNAexp.l<-nrow(countsF_TPM_LOG_DF_micro_MFA)

dat4Facto2<-data.frame(cond=as.factor(cond2),countsF_TPM_LOG_DF_MFA.t,countsFInfo_CNV_backup_MFA.t,countsF_TPM_LOG_DF_micro_MFA.t) 
dim(dat4Facto2)
## [1]  10 670
#We will consider CN as scaled but it would be better to consider it as categorical
es2 = MFA(dat4Facto2, group=c(1,mRNAexp.l,cnv.l,miRNAexp.l), type=c("n","c","s","c"), ncp=5,name.group=c("cond2","mRNAexp","cnv","miRNAexp"),num.group.sup=c(1)) 

top10.1 <- sort(es2$global.pca$var$cor[,"Dim.1"],decreasing=TRUE)[1:10]
top10.1
##  SMAD1.mRNAexp    SRC.mRNAexp PIK3R1.mRNAexp PRKAA1.mRNAexp   AKT3.mRNAexp 
##          0.887          0.839          0.839          0.831          0.828 
##  NFKB1.mRNAexp  MAPK9.mRNAexp   AKT1.mRNAexp  PRKCA.mRNAexp SQSTM1.mRNAexp 
##          0.823          0.817          0.815          0.803          0.802
top10.2 <- sort(es2$global.pca$var$cor[,"Dim.2"],decreasing=TRUE)[1:10]
top10.2
##    SRC.cnv   TGM2.cnv   E2F1.cnv  NCOA3.cnv BCL2L1.cnv PRKAA1.cnv  YWHAB.cnv 
##      0.930      0.930      0.930      0.930      0.930      0.930      0.930 
##  PREX1.cnv CDKN1B.cnv  ERBB3.cnv 
##      0.930      0.926      0.926
top10.3 <- sort(es2$global.pca$var$cor[,"Dim.3"],decreasing=TRUE)[1:10]
top10.3
## hsa.mir.196a.2.miRNAexp   hsa.mir.106b.miRNAexp hsa.mir.196a.1.miRNAexp 
##                   0.886                   0.875                   0.864 
##     hsa.mir.25.miRNAexp            CDK1.mRNAexp   hsa.mir.16.2.miRNAexp 
##                   0.848                   0.793                   0.793 
##   hsa.mir.196b.miRNAexp  hsa.mir.92a.2.miRNAexp           FOXM1.mRNAexp 
##                   0.790                   0.785                   0.776 
##           ACACB.mRNAexp 
##                   0.775
#Extra informative plots
plot(es2,choix="ind",habillage = "cond")

plotellipses(es2, keepvar = "cond")

fviz_mfa_ind(es2, label = "var", habillage = cond2, addEllipses = TRUE, ellipse.level = 0.95) 

fviz_contrib(es2, choice = "quanti.var", axes = 1)

summary(es2)
## 
## Call:
## MFA(base = dat4Facto2, group = c(1, mRNAexp.l, cnv.l, miRNAexp.l),  
##      type = c("n", "c", "s", "c"), ncp = 5, name.group = c("cond2",  
##          "mRNAexp", "cnv", "miRNAexp"), num.group.sup = c(1)) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.137   1.634   1.236   1.070   0.781   0.656   0.488
## % of var.             24.655  18.854  14.268  12.347   9.014   7.569   5.630
## Cumulative % of var.  24.655  43.509  57.777  70.123  79.138  86.707  92.336
##                        Dim.8   Dim.9
## Variance               0.430   0.235
## % of var.              4.957   2.706
## Cumulative % of var.  97.294 100.000
## 
## Groups
##                                 Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## mRNAexp                      |  0.848 39.705  0.359 |  0.615 37.652  0.189 |
## cnv                          |  0.454 21.258  0.098 |  0.920 56.312  0.402 |
## miRNAexp                     |  0.834 39.036  0.590 |  0.099  6.035  0.008 |
##                               Dim.3    ctr   cos2  
## mRNAexp                       0.523 42.304  0.136 |
## cnv                           0.434 35.121  0.090 |
## miRNAexp                      0.279 22.575  0.066 |
## 
## Supplementary group
##                                Dim.1  cos2   Dim.2  cos2   Dim.3  cos2  
## cond2                        | 0.193 0.037 | 0.045 0.002 | 0.088 0.008 |
## 
## Individuals
##                                 Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## TCGA.OR.A5J9.01A.11D.A29H.01 |  1.261  7.437  0.197 |  1.240  9.404  0.191 |
## TCGA.OR.A5JE.01A.11D.A29H.01 | -1.916 17.186  0.417 |  0.533  1.737  0.032 |
## TCGA.OR.A5JF.01A.11D.A29H.01 |  1.507 10.635  0.381 |  0.803  3.946  0.108 |
## TCGA.OR.A5JI.01A.11D.A29H.01 |  1.077  5.430  0.157 |  1.270  9.868  0.218 |
## TCGA.OR.A5K0.01A.11D.A29H.01 |  0.705  2.325  0.046 | -2.017 24.901  0.374 |
## TCGA.OR.A5KV.01A.11D.A29H.01 | -1.398  9.145  0.248 | -0.491  1.474  0.031 |
## TCGA.OR.A5L5.01A.11D.A29H.01 |  0.439  0.902  0.038 |  0.808  3.996  0.127 |
## TCGA.OR.A5LC.01A.11D.A29H.01 | -1.286  7.745  0.158 |  1.172  8.405  0.131 |
## TCGA.OR.A5LE.01A.11D.A29H.01 | -2.231 23.300  0.512 | -1.198  8.788  0.148 |
## TCGA.OR.A5LL.01A.11D.A29H.01 |  1.843 15.895  0.274 | -2.119 27.479  0.363 |
##                               Dim.3    ctr   cos2  
## TCGA.OR.A5J9.01A.11D.A29H.01  1.105  9.873  0.152 |
## TCGA.OR.A5JE.01A.11D.A29H.01 -0.413  1.381  0.019 |
## TCGA.OR.A5JF.01A.11D.A29H.01  0.222  0.400  0.008 |
## TCGA.OR.A5JI.01A.11D.A29H.01 -1.051  8.941  0.149 |
## TCGA.OR.A5K0.01A.11D.A29H.01  1.211 11.853  0.135 |
## TCGA.OR.A5KV.01A.11D.A29H.01 -1.566 19.845  0.312 |
## TCGA.OR.A5L5.01A.11D.A29H.01 -1.357 14.894  0.359 |
## TCGA.OR.A5LC.01A.11D.A29H.01  1.959 31.026  0.367 |
## TCGA.OR.A5LE.01A.11D.A29H.01  0.274  0.606  0.008 |
## TCGA.OR.A5LL.01A.11D.A29H.01 -0.382  1.181  0.012 |
## 
## Continuous variables (the 10 first)
##                                 Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## DIRAS3.mRNAexp               | -0.309  0.069  0.017 |  1.859  3.251  0.604 |
## MAPK14.mRNAexp               |  0.438  0.138  0.294 | -0.309  0.090  0.147 |
## YAP1.mRNAexp                 |  0.468  0.158  0.286 | -0.597  0.335  0.465 |
## CDKN1B.mRNAexp               |  0.357  0.092  0.254 | -0.403  0.152  0.324 |
## ERBB2.mRNAexp                |  0.515  0.191  0.245 | -0.169  0.027  0.026 |
## G6PD.mRNAexp                 |  0.383  0.105  0.086 |  1.026  0.992  0.620 |
## KDR.mRNAexp                  |  0.557  0.223  0.110 |  1.281  1.543  0.579 |
## AKT1S1.mRNAexp               |  0.007  0.000  0.000 |  0.165  0.026  0.055 |
## MAPK8.mRNAexp                |  0.563  0.228  0.399 | -0.337  0.107  0.143 |
## PRKCD.mRNAexp                |  0.015  0.000  0.000 |  0.138  0.018  0.030 |
##                               Dim.3    ctr   cos2  
## DIRAS3.mRNAexp               -0.444  0.245  0.034 |
## MAPK14.mRNAexp                0.347  0.150  0.184 |
## YAP1.mRNAexp                 -0.265  0.087  0.091 |
## CDKN1B.mRNAexp                0.281  0.098  0.158 |
## ERBB2.mRNAexp                -0.230  0.066  0.049 |
## G6PD.mRNAexp                  0.147  0.027  0.013 |
## KDR.mRNAexp                  -0.194  0.047  0.013 |
## AKT1S1.mRNAexp                0.177  0.039  0.063 |
## MAPK8.mRNAexp                 0.144  0.026  0.026 |
## PRKCD.mRNAexp                 0.057  0.004  0.005 |
## 
## Supplementary categories
##                                 Dim.1   cos2 v.test    Dim.2   cos2 v.test  
## old                          |  0.642  0.434  1.317 | -0.271  0.077 -0.635 |
## young                        | -0.642  0.434 -1.317 |  0.271  0.077  0.635 |
##                               Dim.3   cos2 v.test  
## old                           0.330  0.115  0.892 |
## young                        -0.330  0.115 -0.892 |
#Overall, MFA helps understand the underlying structure of the data by reducing its dimensionality and highlighting the relationships between variables and observations.
#Based on MFA summary eigenvalues, the first three dimensions of MFA capture 57.77% (24.66% (dim1)+18.85% (dim2) + 14.268 (dim3)) of total variance.
#Based on MFA summary group analysis, compared to GISTIC cnv recurrent lesions, the miRNA-seq and mRNA-seq variables 
#co-contribute most and have highest significant impact to the first dimension, while GISTIC cnv contributes the most towards 
#dimension#2 (0.9 vs. 0.009). The top genes impacting dimension#1 are (from mRNA-seq data block variable)
#SMAD1,SRC, PIK3R1, PRKAA1, AKT3, NFKB1, MAPK9, AKT1, PRKCA. and SQSTM1. 
#The top genes impacting dimension#2(from GISTIC CNV gene-based recurrent lesions data block variable) are SRC,
#TGM2,  E2F1, NCOA3, BCL2L1, PRKAA1, YWHAB, PREX1, CDKN1B, and ERBB3. The top genes impacting dimension#3(from miRNA-seq data block variable)
#are hsa.mir.196a.2,hsa.mir.106b, hsa.mir.196a.1, hsa.mir.25, hsa.mir.16.2, hsa.mir.196b, hsa.mir.92a.2, 
#and (from mRNA-seq data block)CDK1, FOXM1,and ACACB.  
#Based on MFA analysis, there is clear separation between cnv, mRNA, and miRNA block data
#Based on individuals Analysis examining how individual data points relate to each dimension,
#the first ten individuals show their positions in the multidimensional space.No clear segregation between young and old patient samples is apparent. Of the ten selected patient samples,
#A5J9 (young), A5JF(old),A5JI(young),A5K0(old),A5L5(old),A5LL(old) contribute positive coefficients towards dimension#1, while
#A5JE (young), A5KV(young),A5LC(old),A5LE(young) contribute negative coefficients towards dimension#1
#Young Patients TCGA.OR.A5LE, A5J9, A5JE appear to be outliers. Old patients A5K0, A5LL, A5JF, and A5LC appear to be outliers, suggesting that the
#10 patients selected were not appropriate. The mRNA expression dimension seem  to coincide with the age.status condition 
#more than the other 2 data blocks.#Based on MFA continuous Variables analysis, which indicates the relationship between the original variables,
#and the extracted dimensions, the mRNA-seq data block genes strongly influence Dimension 1 compared to miRNA-seq and GISTIC CNV data block variables.